Program at a glance
Location:Ciutadella CampusRamon Trias Fargas, 2527 08005 Barcelona Conference Registration and the Conference Opening/Keynotes are located in the underground space between the Jaume I building (Building 20) and the Roger de Llúria building (Building 40). There will be signage and volunteers to direct you. 
Full program online Floor plans
Download the BigSurv18 App 
All sponsor exhibits will be in Room 30.S02 S. Expo on Friday 26th and Saturday 27th
Methods to Improve Survey Representativeness Using High Dimensional Data 

Chair  Professor Michael Elliott (University of Michigan) 
Time  Friday 26th October, 14:15  15:45 
Room:  40.002 
The explosion of "big data" in the 21st Century has led some researchers to declare the end of the traditional probability sample paradigm. We propose three talks that refute that belief, and discusses methods to leverage the strengths of high dimensional data, either from probability or nonprobability samples, with the strengths of more traditional survey data.
Calibrating Big Data for Population Inference: Applying QuasiRandomization Approach to Naturalistic Driving Data Using Bayesian Additive Regression Trees
Professor Michael Elliott (University of Michigan)  Presenting Author
Mr Ali Rafei (University of Michigan)
Professor Carol Flannagan (University of Michigan)
Although probability sampling has been the “gold standard” of population inference, rising costs and downward trends in response rates has led to a growing interest in nonprobability samples. Nonprobability samples, however, can suffer from selection bias. Here we develop “quasirandomization” weights to improve the representativeness of nonprobability samples. This method assumes the nonprobability sample actually has an unknown probability sampling mechanism that can be estimated using a reference probability survey. We apply the proposed method to improve the representativeness of the University of Michigan Transportation Research Institute Safety Pilot Study, which consists of a convenience sample of over 3,000 vehicles that were instrumented and followed for an average of one year, using the National Household Transportation Survey as our probability sample of drivers.
Bayesian Inference for Sample Surveys in the Presence of HighDimensional Auxiliary Information
Mr Yutao Liu (Columbia University)
Professor Andrew Gelman (Columbia University)
Professor Qixuan Chen (Columbia University)  Presenting Author
The National Drug Abuse Treatment System Survey (NDATSS) is a panel survey of substance abuse treatment programs in the United States. In 2013, the NDATSS conducted its first wave of a panel survey on residential nonopioid treatment programs (nonOTPs). A random sample of programs was selected from a sampling frame constructed using the 2010 National Survey of Substance Abuse Treatment Services (NSSATS), the latest available annual census of all substance abuse treatment programs in the United States when the NDATSS was planned. From 2010 to 2013, the population of residential nonOTPs changed, with new programs opening and some old programs closing. To account for the change in the population as well as potential response bias, we propose a Bayesian multilevel model to improve the survey inference of population quantities in the NDATSS using the newly released 2013 NSSATS data, which contains a rich profile of uptodate information about all residential nonOTPs in the nation. In the first level of the model, we regress each survey outcome on the propensity that a program was included in the NDATSS and auxiliary variables in the 2013 NSSATS that were associated with the survey outcomes of interest. To allow a flexible association between the survey outcomes and the inclusion propensities, we used a penalized spline. In the second level of the model, we further model the propensity of inclusion in the NDATSS using Bayesian classification trees, which can naturally handle highdimensional auxiliary variables in the 2013 NSSATS data, interactions and nonlinearity, and uncertainty in the estimation of inclusion propensities for all programs in the population. We then predict each survey outcome for the nonsampled residential nonOTPs using the inclusion propensities and auxiliary variables associated with those programs. We compute the posterior distributions of the population means and proportions via Markov chain Monte Carlo simulation using the Bayesian inference engine Stan.
How NonIgnorable is the Selection Bias in Nonprobability Samples? An Illustration of New Measures Using a Large Genetic Study on Facebook
Professor Brady West (University of Michigan)  Presenting Author
Professor Phil Boonstra (University of Michigan)
Professor Roderick Little (University of Michigan)
Mr Jingwei Hu (University of Michigan)
Many survey researchers are currently evaluating the utility of "big data" that are not selected by probability sampling. This means that measures of the degree of potential bias from nonrandom selection of cases from a given population are sorely needed. Existing indices of degree of departure from representative probability samples like the RIndicator are based on functions of the propensity of inclusion, and are based on modeling the inclusion probability as a function of auxiliary variables. These methods are agnostic about the relationship between the inclusion probability and survey outcomes, which is a crucial feature of the problem. We will first describe and empirically evaluate (via simulation) simple indices of degree of departure from ignorable selection for estimates of means that correct this deficiency, called unadjusted and adjusted potential absolute bias (PAB). The indices are based on normal patternmixture models applied to the problem of sample selection, and are grounded in the modelbased framework of nonignorable selection, first proposed in the context of nonresponse by Rubin (1976 Biometrika). The methodology also provides for sensitivity analyses to adjust inferences for departures from ignorable selection, before and after adjustment for auxiliary variables.
We will then apply the proposed indices to data from the Genes for Good (GfG) project (https://genesforgood.sph.umich.edu/), which recruits a nonprobability sample of study volunteers via Facebook for genetic profiling and also collects data on important risk factors for cancer (e.g., obesity) for predictive modeling purposes. Using matched genetic profiling data from the Health and Retirement Study (HRS, which is a probability sample; see http://hrsonline.isr.umich.edu/) as a population benchmark, we will use our proposed indices to evaluate the extent of bias in GfG estimates of the prevalence of particular risk factors for cancer for both males and females age 50 and above. We will conclude with recommendations for future work in this area.
Evaluating Doubly Robust Estimation for Online OptIn Samples With Bayesian Additive Regression Trees
Mr Andrew Mercer (Pew Research Center/Joint Program in Survey Methodology)  Presenting Author
Correcting selection bias in estimates from online, optin survey samples requires the use of a statistical model that makes the outcome variable of interest conditionally independent of inclusion in the sample. Elliott and Valliant (2017) describe two broad approaches for addressing this problem: quasirandomization, which involves fitting a model to predict inclusion in the sample, and superpopulation inference, where the model predicts the survey outcome. A third approach, less commonly discussed in the context of optin surveys, is doublyrobust estimation, in which both outcome regression and inclusion propensity models are fit. If either one is correctly specified, the estimates will be asymptotically consistent.
Which approach works best in practice? If researchers have a high degree of confidence in their ability to predict either survey participation or the outcome variable, logic would suggest they choose the approach that best fits with their prior knowledge. However, ignorable selection cannot be known to hold with certainty, and researchers may lack extensive knowledge of potential confounders, especially when there is little visibility into the recruitment and sampling process as is often the case with online, optin samples. Doublyrobust strategies could hedge against uncertainty but can have high variance and may be even less accurate if both the propensity and outcome models are incorrect. Given that researchers will often produce survey estimates under less than ideal circumstances, do any of these approaches tend to produce better results when ignorability assumptions are not met?
In this paper, we use Bayesian additive regression trees (BART) to compare each of these estimation strategies for optin samples. We compare four estimators: propensity weighting (PW), outcome regression (OR), and two doublyrobust estimators. These are outcome regression with a residual bias correction (ORRBC), and outcome regression with the propensity score as a covariate (ORPSC). BART is an accurate, flexible, Bayesian machine learning algorithm that has been shown to perform well in causal inference and missing data imputation settings. It readily accommodates interactions and nonlinearities, reducing the need for assumptions about functional form.
Using 10 parallel survey datasets from different online optin sample vendors, we evaluate the performance of each estimator with respect to bias, variance, and root mean squared error on five estimated measures of civic engagement for which ignorability assumptions are clearly invalid. With few exceptions, we find that ORRBC and PW produce very similar point estimates with the lowest bias, though ORRBC exhibits lower variance and RMSE. In contrast, ORPSC and OR also produce very similar point estimates with higher biases, but in this case the doublyrobust ORPSC estimates display the highest variance and RMSE.