OSS9097 – Collecting and Analyzing Big Data
This course is an introduction to collecting and analyzing "big data" for social scientists. Over the last decade, the variety and types of data available to researchers have exploded. This includes not only contemporary data, such as from websites and social media platforms, but also historical data, from digitized interviews to 19th century newspapers. At the same time, analytic techniques from computer science are increasingly being used to solve social science problems.
One week is not enough time to master the techniques for collecting and analyzing big data. You will, however, be able to establish the foundation for developing these skills. The course is designed as a practical overview. The emphasis each class will be on applying the specific techniques rather than on their mathematical basis. The course will provide an overview in that each lesson will introduce a new method in order to demonstrate the range of methods. Combined, students will have the skills and resources to apply these methods to theoretically-relevant problems in the social sciences.
By the end of the course, I expect that students will be able to:
- Collect data from the internet using web scraping and APIs.
- Read and write digital text files.
- Analyze data using supervised learning technique such as random forest models.
- Analyze data using unsupervised learning techniques such as topic models.
- Understand and apply current methods for analyzing texts.
- Link machine learning methods to relevant social science questions.
- Program in Python
Formal prerequisite knowledge
Students should have a Python 3.7 distribution appropriate for data science installed on their computer. The recommended way to do this is to install Continuum's Anaconda Python distribution (https://www.anaconda.com/download/). It is free and available for all operating systems. Students are not expected to have any knowledge of Python. Specific directions on packages and installation will be provided at least one month before the start of the course.
Students have the option of submitting a research paper in order to receive ECTS credits. These research papers (6000 to 8,000 words) should apply one or more of the techniques used in the course to a theoretically interesting research question. Papers should generally follow the format of a research article in the student's discipline, although the literature review may be more concise than normal. Additionally, students must provide code, and where feasible, data, to replicate the analysis. This is to be completed within 8 weeks after the course.
Neal Caren is an Associate Professor of Sociology at the University of North Carolina, Chapel Hill. His research interests center on the quantitative analysis of protest and social movements. His work has been published in the American Sociological Review, Social Forces, Social Problems, and the Annual Review of Sociology. The data in many of his publications has been either scraped from the web, downloaded using APIs, or otherwise involved collected and analyzing texts. He is the author of a well-used publicly available script for converting Lexis-Nexis article downloads into a CSV file. For several years, he has run a graduate workshop on computational social science and digital data collection, has given external workshops on the topic, and has many several tutorials available online. He is also the editor of the social movements journal Mobilization.