Synthetic Data Generation for Small Area Estimation with Application to Large-Scale Surveys


Confidentiality and Related Topics in SAE

Author: Joseph Sakshaug (University of Manchester, UK and Institute for Employment Research, Germany)

Small area statistics provide an important source of information used to study local trends related to social, health, and economic phenomena. However, most large-scale sample surveys, for which rigorous measures of these phenomena are collected, are not designed for purposes of producing reliable small area estimates. A further complication is that data disseminators are typically prohibited from releasing small-area identifiers in public-use survey data sets due to disclosure risk concerns. In this presentation, I will examine a method of generating synthetic microdata that permits detailed geographical information to be released in public-use data files. The method is based on a hierarchical Bayesian model that accounts for multiple levels of geography and complex sample design features (e.g., stratification, clustering). The model is used to simulate multiple, fully-synthetic versions of the observed data. Inferences based on these simulated (or synthetic) data files are then made possible using standard combining rules. The method is demonstrated on two large-scale national surveys for which small area estimates are desired: The National Health Interview Survey and the American Community Survey. The analytic properties of the resulting small area inferences are presented using direct comparison with the observed data, simulations, and a cross-validation study.

Only logged in users can see slides when author's permission was given. Please register to have access.