Skip to main content

Generating Evidence

Does a program or intervention result in improved outcomes?

This is the fundamental question for the generation of evidence. Randomized experiments provide the simplest and most rigorous means to answer to this question, yet real-world constraints may limit the practicality and complicate the implementation, of experiments. The STEPP Center endeavors to facilitate research designed to establish causal relations, by addressing challenges to the process of scientific inference.

Current Research Projects

Our work has focused on approaches to designing and analyzing experiments, techniques for analyzing data from quasi-experiments and observational studies, and methods for generalizing experimental results to policy-relevant populations.

Education Research During the Pandemic

Hundreds of randomized trial in education have been interrupted by the pandemic. This project seeks to understand how to preserve the value of funded trials that have been disrupted by the pandemic. We will offer guidance and resources to investigators who are conducting ongoing trials.

Involves: Larry V. Hedges and Beth Tipton

Generalizability and Heterogeneity of Causal Effects

Research studies, including randomized trials, are only useful if they have a clear scope of application. However, randomized trials often focus only on providing an estimate of the average effect of an intervention in a convenience sample. This project seeks to develop new methods for improving generalizations from causal studies to target populations, as well as methods for designing studies with generalizability and treatment effect heterogeneity in mind.

Involves: Beth Tipton, Jessaca Spybrook, Katie Fitzgerald

The Generalizer

The Generalizer provides software to help investigators plan trials with a specified scope of applications and interpret the scope of application of trials that have already been conducted. This project focuses on updating The Generalizer to include post-secondary data (IPEDS) as well as the ability to perform statistical power analyses.

Beth Tipton, Jessaca Spybrook, Michael Weiss, Katie Coburn, Beatrice Chao, and Lauren Chandler Holtz

Effect Sizes and Effect Size Estimation

Effect sizes are a fundamental tool in reporting, interpreting, and synthesizing research. This project seeks to understand fundamental problems in defining and estimating effect size and statistical inference about effect sizes in social research.

Larry Hedges, Beth Tipton, Rrita Zenullahi, Karina Diaz, Katie Coburn

Effect Sizes in Single Case Designs

Single base designs are used widely in special education and medicine and are often the predominant design used for studying low incidence diseases and disabilities. Evidence from single case designs has been difficult to synthesize incorporate in evidence databases, because it lacked effect size measures that were comparable to more conventional (between-groups) studies. This project develops effect size measures that are comparable to those for other designs and can be used in syntheses and evidence databases.

Involves: Larry Hedges, Prathiba Natesan

Research Focus Areas

Designing Experiments

Working Papers

Hedges, L. V. & Schauer, J. (2019). The design of replication studies. Evanston, IL: Northwestern University Institute for Policy Research Working Paper.

Katsanes, R. (2017). Design and analysis of trials for developing adaptive interventions in education. Evanston, IL: Northwestern University Institute for Policy Research Working Paper.


Bilimoria, K. Y., J. W. Chung, L. V. Hedges, A. R. Dahlke, R. Love, M. E. Cohen, J. Tarpley, J. Mellinger, D. M. Mahvi, R. R. Kelz, C. Y. Ko, D. B. Hoyt, and F. H. Lewis. (2016). Development of the flexibility in duty hour requirements for surgical trainees (FIRST) trial protocol: A national cluster-randomized trial of resident duty hour policies. Journal of the American Medical Association, Surgery, 151, 273-81. DOI: 10.1001/jamasurg.2015.4990.

Hedberg, E. C. & Hedges, L. V. (2014). Reference values of within-district intraclass correlations of academic achievement by district characteristics: Results from a meta-analysis of district-specific data. Evaluation Review, 38, 546-582. DOI: 10.1177/0193841X14554212.

Hedges, L. V. & Borenstein, M. (2014). Constrained optimal design in three and four level experiments. Journal of Educational and Behavioral Statistics, 39, 257-281. DOI: 10.3102/1076998614534897

Spybrook, J., Hedges, L. V., & Borenstein, M. (2014). Understanding statistical power in cluster randomized trials: Challenges posed by differences in notation and terminology, Journal of Research on Educational Effectiveness, 7, 384-406. DOI: 10.1080/19345747.2013.848963

Hedges, L. V. & Hedberg, E. C. (2013). Intraclass correlations and covariate outcome correlations for planning two- and three-level cluster-randomized experiments in education. Evaluation Review, 37, 13-57. DOI: 10.1177/0193841X14529126

Hedges, L. V., Hedberg, E. C., & Kuyper, A. (2012). The variance of intraclass correlations in three and four level models. Educational and Psychological Measurement, 72, 893-909. DOI: 10.1177/0013164412445193

Analyzing Experiments

Pustejovsky, J. & Tipton, E. (2018). Small sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. Journal of Business and Economic Statistics, 36(4), 672-683. 3. DOI: 10.1080/07350015.2016.1247004.

Hedges, L. V. & Olkin, I. (2016). Overlap between treatment and control group distributions of an experiment as an effect size measure. Psychological Methods, 21, 61-68. DOI: 10.1037/met0000042

Hedges, L. V. & Citkowicz, M. (2015). Estimating effect size when there is clustering in one treatment group. Behavior Research Methods, 47, 1295-1308. DOI: 10.3758/s13428-014-0538-z.


Publications coming soon.

Generalizing Experimental Results

Working Papers

Tipton, E. Sample selection in randomized trials with multiple target populations. Working paper.

Tipton, E. Beyond the ATE: Designing randomized trials to understand treatment effect heterogeneity. Working paper.


Tipton, E., Yeager, D., Schneider, B., & Iachan, R. Designing probability samples to identify sources of treatment effect heterogeneity. In P.J. Lavrakas (Ed.). Experimental methods in survey research: Techniques that combine random sampling with random assignment. New York, NY: Wiley.

Chung, J. W., Hedges, L. V., Bilimoria, K. Y. et al. (2018). The estimation of population average treatment effects in the FIRST trial: Application of a propensity score-based stratification approach. Health Services Research, 2567–2590. DOI: 10.1111/1475-6773.12752.

Tipton, E., & Hedges, L. V. (2017). The role of the sample in estimating and explaining treatment effect heterogeneity. Journal of Research on Educational Effectiveness, 10, 903-909. DOI:10.1080/19345747.2017.1364563.

Tipton, E. & Peck, L. (2017) A design-based approach to improve external validity in welfare policy evaluations. Evaluation Review (Special Issue: External Validity 1), 41(4), 326-356. DOI:10.1177/0193841X16655656.

Levay, K. E., Freese, J., & Druckman, J. N. (2016). The demographic and political composition of Mechanical Turk samples. Sage Open, 6(1), 2158244016636433. DOI:10.1177/2158244016636433.

Tipton, E., Hedges, L. V., Hallberg, K. & Chan, W. (2016). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 40, 1-34. DOI:10.1177/0193841X16655665.

Mullinix, K. J., Leeper, T. J., Druckman, J. N., & Freese, J. (2015). The generalizability of survey experiments. Journal of Experimental Political Science, 2(2), 109-138. DOI:10.1017/XPS.2015.19.

O’Muircheartaigh, C. & Hedges, L. V. (2014). Generalizing from experiments with non-representative samples. Journal of the Royal Statistical Society, Series C, 63, 195-210. DOI:10.1111/rssc.12037.

Tipton, E. (2014). How generalizable is your experiment? Comparing a sample and population through a generalizability index. Journal of Educational and Behavioral Statistics, 39(6), 478-501. DOI:10.3102/1076998614558486.

Tipton, E. (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239-66. DOI:10.3102/1076998612441947.

Education & Training

To make these methods more accessible to researchers the Center provides tutorial papers, online tools and resources including working papers, seminars and short courses that train students and practitioners, and professional development institutes for established researchers.