There are two short course tracks: (1) the survey track and (2) the computer/data science track, presenting two short courses each. All courses will be taught in English. For the exact time and location of all short courses, see https://www.bigsurv18.org/program.
Biases and Their Consequences: Learning From the Total Survey Error Framework by Dr. Frauke Kreuter
Data science is often referred to as the art of extracting insights from data. As such, the focus is very much on techniques to analyze data and on tools to do so easily with large quantities of unstructured data. Increasingly, however, heavy users of data worry about (1) errors in data collected, (2) biases that can creep into analyses — in particular machine learning models, and (3) the effects of decisions made during data curation and processing. Such a focus on potential flaws and remedies in the creation of survey statistics has always been a focus of survey methodologists. It is only natural, therefore, for our field to expand our methodological perspective to data beyond surveys. Moreover, it is often only through the combination of several and different data sources that we are in a position to evaluate data quality, and with that we gain confidence in the analytic results. Against this backdrop, this course will sketch the total survey error framework, bridge between error sources identified there and error sources as they occur in other data sources, and discuss with participants strategies to mitigate error and their effects on statistics and prediction results.
This course is geared towards social and computer/data scientists who are unfamiliar with the total survey error concept or with the discussion around fair machine learning models and participants interested in the following:
- Discussing ways to identify errors across different data types,
- Understanding data generation processes,
- Identifying sources of errors in survey data and common, found data, and
- Understanding the effects of biases on estimates, in particular on prediction models in the machine learning context.
Frauke Kreuter is director of the Joint Program in Survey Methodology at the University of Maryland, head Statistical Methods group at the Institute for Employment Research in Nuremberg, and Professor at the University of Mannheim. Prior to her appointment at the University of Mannheim she held an Professorship at the Institute for Statistics at the Ludwig-Maximilians-Universität in Munich, Germany. She received her PhD from the University of Konstanz in 2001. Her research focuses on data quality, the use of paradata to improve surveys, and the joint use of survey and administrative data, as well as other newly emerging data sources.
Adaptive Survey Design by Dr. Andy Peytchev
Adaptive survey designs (ASDs) provide a framework for data-driven tailoring of data collection procedures to different sample members, often for cost and bias reduction. People vary in how likely they are to respond and in how they respond. They also vary in what can motivate them to participate in a survey. This heterogeneity leads to opportunities to selectively deploy design features in order to control costs as well as nonresponse and measurement errors. ASD aims at the optimal matching of design features and the characteristics of respondents given the survey budget. This calls for more complex designs that need additional planning, modeling, simulation and testing, monitoring, evaluation, and further optimization. The main objectives of this course are to provide an overview of ASDs and introduce each component of this approach needed for implementation by using several illustrative examples from surveys.
These designs may be particularly promising in combining multiple sources of data, such as administrative data and survey data. For example, a national survey of students in the United States combines administrative data from multiple sources and collects survey data, subjected to nonresponse. In an ASD, imputation models were estimated to determine for which students statistical models could not predict key variables very well. These models were then used to target data collection effort to those students in order to maximize the amount of information that can be collected.
This course is based on the recent book, Adaptive Survey Design, by Barry Schouten, Andy Peytchev, and James Wagner.
This course is geared towards survey methodologists, managers, statisticians, researchers, and computer/data scientists interested in understanding what constitutes ASDs and their potential utility in modern surveys and learning how to implement ASDs, with particular emphasis on stratification, strategies and interventions, modeling, monitoring, costs and logistics, and optimization of ASDs.
Andy Peytchev is a senior survey methodologist at RTI International, where his work includes the design and implementation of responsive and adaptive survey designs in web, telephone, and in-person surveys. Prior to that, he was an Assistant Research Professor at the University of Michigan, where he co-authored a book on Adaptive Survey Design with Barry Schouten and James Wagner.
Computer/Data Science Track
Introduction to Computational Text Analysis by Dr. Rochelle Terman
This short course introduces students to modern quantitative text analysis techniques. The goal is to provide an orientation for those wishing to go further with text analysis in their own research. We will discuss preprocessing texts into data (covering n-grams, stop words, stemming, and document-term matrices); comparing texts with discriminating words; and sentiment analysis using dictionary methods. Time permitting, we will introduce more advanced supervised and unsupervised machine learning methods, including topic models. We will demonstrate these techniques using the open source programming language R.
Prerequisites: Participants must have basic computer skills and be familiar with their computer’s file system. Basic knowledge of R programming is helpful but not required. Participants with no prior experience with R may wish to complete this brief tutorial (requiring 2-3 hours) to learn the basics of R before the course.
This course is geared towards social scientists who work with unstructured text data, including (but not limited to) news and media, open-ended surveys, and social media posts. Participants must have basic computer skills, be familiar with their computer’s file systems (e.g. paths). Basic knowledge of R programming is helpful but not required. By the end of the course, participants will:
- Be familiar with the main methods and techniques involved in modern computational text analysis.
- Be able to load, preprocess, and conduct simple analysis on text data.
- Know where to go next in their pursuit of more advanced computational text methods.
Rochelle Terman is currently a Provost Postdoctoral Fellow in the Department of Political Science at University of Chicago, where she will begin as Assistant Professor in 2020. Her research examines international norms, gender, and advocacy, with a focus on the Muslim world using a mix of quantitative, qualitative, and computational methods. She also teaches computational social science in a variety of capacities.
Big Data Processing for Social Science: An Introduction to Apache Spark by Ian Thomas
While enabling new research possibilities, Big Data also introduces new challenges to analysis and interpretation. Today, much research can still be done on personal computers or research servers. Sometimes, however, datasets get so large that more computing power is needed. For example, the posts on Reddit--a popular online forum--are available to researchers but the full dataset is over 1.5TB. Common Crawl, allows researchers to access 5 billion web pages, and GDELT (Global Database of Events, Language, and Tone) offers over 250M records monitoring the world broadcast, print and web news. A popular and effective tool for utilizing these resources is Apache Spark.
Spark is an open source computing platform, maintained by the Apache Software Foundation, that let’s researchers perform an analyses on many computers at once. Spark significantly reduces the complexity of analyzing large datasets, but it can also be used on a single computer. This makes it an ideal tool for researchers and analysts.
This course will introduce participants to the fundamentals of using Spark for Social Science by introducing modern approaches for working with large datasets; reviewing when these approaches are most appropriate, fundamental mechanisms and basic internals of the Spark framework; and providing hands-on examples that demonstrate Sparks' capabilities, speed, and programmatic idioms.
This course is geared towards anyone looking for an introduction to Apache Spark for Social Science. Participants should have some programming experience--ideally in R, Python, SAS, or SPSS. A laptop is required to work through exercises
Ian Thomas leads the development of data products and large-scale data processing in the Center for Data Science at RTI International. In his time at RTI, Mr. Thomas has lead the Substance Abuse and Mental Health Data Archive (SAMHDA) Data Analysis System, web based twitter research tools, and an Apache Spark-based social media text collection and processing pipeline. Prior to joining RTI, he was a data engineer and reporting analyst for Epic Games, where he developed Hadoop-based large-scale data pipelines for collecting and analyzing incoming data from millions of users, and using that information to build interactive dashboards.