Preserving Statistical Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries resulting from misapplication of statistical data analysis. Existing approaches to ensuring validity of inferences drawn from data assume a fixed collection of hypotheses to be tested, or analysis to be applied, selected non-adaptively before the data are examined. In contrast, the practice of data analysis in scientific research is by its nature an adaptive process, in which new hypotheses are generated and new analyses are performed on the basis of data exploration and observed outcomes on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from private data analysis. As an application we show how to safely reuse a holdout set a great many times without undermining its validation power, even when hypotheses, models, and algorithms are chosen adaptively.
Joint work with Cynthia Dwork, Moritz Hardt, Toni Pitassi, Omer Reingold and Aaron Roth.