awesome-replicability-data

This repository collects publicly available datasets for replicability analysis. Currently, we curate a collection of paired individual-level datasets of original and replication studies, and one-sided pairs with individual-level data for the replication study. We are non-selective in collecting these datasets, i.e., both successful and failed studies are included as long as they are available.

This repository accompanies the paper "Diagnosing the role of observable distribution shift in scientific replications" by Ying Jin, Kevin Guo and Dominik Rothenhäusler. [Reference]

Please feel free to contact us at ying531[at]stanford[dot]edu, or open an issue if you have suggestions for replication datasets not collected here!

Related resources

R package. Our R package repDiagnosis provides statistical tools for estimating the contribution of observable distribution shifts in replication studies, such as covariate difference and mediation shifts. Paired data 1, 3, 8 below are cleaned and pre-loaded in the R package for use.

Interactive diagnosis app. Play with our interactive analysis tools in our online R shiny app! Quick start with pre-loaded datasets in the app (datasets 1, 3, 8 below). You can also diagnose your own replication study, or probe the generalizability of your single study.

Example analysis. We provide in analysis.html a analysis report for other datasets that we did not elaborate on in our paper.

List of complete, paired datasets

Below we list links to papers and datasets for original and replication studies where both of them have individual-level data publicly available. The Processed column links to data folder in this repo (if any) which we processed from publicly available data. Clicking the link in Name column jumps to texts that summarize the studies.

Name	Original paper	Original data/repo	Replication paper	Replication data/repo	Processed
1. Covid information	Pennycook, et al., 2020	OSF link	Roozenbeek, et al., 2021	OSF link	Folder link
2. Empathy and SES	Côté, et al., 2013	no data	Babcock, et al., 2017 (two reps)	OSF link	Folder link
3. EMDR and misinformation	Houben, et al., 2018	OSF link	Calvillo and Emami, 2019	OSF link	Folder link
4. Self-centrality and mind-body practice	Gebauer, et al., 2018	yoga meditation analysis	Vaughan-Johnston, et al., 2021	yoga meditation	Folder link
5. Queueing design	Shunko, et al., 2018	data zipfile	Long, et al.	data zipfile	Folder link
6. Multi-lab disgust and moral judgement			Ghelfi, et al., 2020	OSF link (to all studies)	Folder link
7. Pain and cooperation	Bastian, et al., 2014	OSF link	Prochazka, et al., 2022	OSF link	Folder link
8. Cleaniness and moral judgement	Schnall, et al., 2008	OSF link	Johnson, et al., 2014	OSF link	Folder link
9. Lie and foreign language	Suchotzki and Gamer, 2008	OSF link	Frank, et al., 2019	OSF link	Folder link
10. Multi-lab ego depletion	Rep 1: Hagger, et al., 2016	OSF link	Rep 2: Dang, et al., 2020	OSF link
11. Honesty and time	Shalvi, et al., 2012	data in replication OSF link	Van der Gruyssen, et al., 2020	Rep 1, Rep 2	Folder link

List of one-sided datsets

Below we collect one-sided original-replication study pairs, i.e., where the replication study has individual-level data, while the original study has only summary statistics available. We include such datasets if the original paper contains rich summary statistics. These summary statistics, together with individual-level data of the replication study, are processed and stored in the links in Processed column. Clicking the link in Name column jumps to texts that summarize the studies.

Name	Original paper	Replication paper	Replication data/repo	Processed
1. Climate change misinformation	van der Linden, et al., 2015	Williams and Bond, 2020	OSF link	Folder link
2. Pain-tolerance metaphor	Sierra, et al., 2016	Pendrous, et al., 2020	OSF link	Folder link
3. Body dissatifaction	Martijn, et al., 2010	Glashouwer, et al., 2019	Database link	Folder link
4. Priming and exercise	Pottratz, et al., 2021	Timme, et al., 2022	OSF link	Folder link

Details of paired studies and datasets

1. Covid information study dataset

Background. This study investigates the effect of a `nudge' for thinking about truthfulness of information on the ability of truth discernment when sharing COVID-related news. The treated were asked to rate the accuracy of several headlines, and all participants rated how likely they were to share them on social media.
Sample sizes. The original study by Pennycook et al. recruited n = 1145 participants, while the replication study by Roozenbeek et al. had sample size N = 1583.
Variables. The outcome variable is ratings, which is the rating for willingness to share the headlines. In addition, both studies measured demographical information including age, gender, education, ethnicity. Other measures include cognitive reflection crt, science knowledge sciknow, medical maximizer-minimizer scale mms, etc. The binary treatment is encoded in treatment column, and real is a binary indicator of whether the information is correct.
Results. The original study finds a statistically significant estimate of the interaction of treatment and news truthfulness, i.e., treated participants were less willing to share headlines that were perceived as less accurate. The replication study failed to detect such effect in the first stage with N = 701, but find a significant but smaller effect after collecting the second round of data with pooled N = 1583.

2. Empathy and SES dataset

Background. Babcock et al. conducted two replications of one study from Côté et al., regarding the effect of inducing emphathy on utilitarian moral judgment across socialeconomic status (SES). Treated participants took an emphathy nudge, and then all participants completed an allocation task.
Sample sizes. The original sample size was n = 91. The first replication study had sample size N1 = 230, and the second had N2 = 300.
Variables. The primal outcome is Decision_DV, i.e., how many dollars they would take away from the 'lose' member in the allocation task, as a measure of utilitarian moral judgement. Control variables including age, gender, ethnicity, income, riligiousity, political orientation, etc., were also collected. Intermediate outcomes on how much they felt compassionate, moved, and sympathetic towards the 'lose' member were also collected. We clean the datasets for the two replication studies separately.
Results. The original study found a significant effect of the interaction of experimental condition and SES. Study 1 in the replication study did not replicate this result, while the second replication study did.

3. EMDR and misinformation dataset

Background. This study concerns the effect of eye movement on susceptibility to false memories. These eye movements are a standard component of ``eye movement desensitization and recprocessing", a standard intervention for posttraumatic stress disorder.
Sample sizes. The original study by Houben et al. had sample size n = 82, while the direct replication by Calvillo et al. had sample size N = 120.
Variables. The outcome variable are the total number of correct answers and the total number of misinformation after the experiment. In addition, both studies collect gender, age, pre- and post-intervention vividness of memory and emotionality, with one depression level measure differing from BDI to BDI-II.
Results. The original study found a statistically significant effect of eye movement on increasing false memories, while the replication study did not.

4. Self-centrality and mind-body practice dataset

Background. This study investigates whether mind-body practices (yoga in experiment 1 and meditation in experiment 2) increase self-enhancement. In experiment 1, waves of local yoga participants were randomly assigned to treatment and control by week. In experiment 2, participants were recruited from an undergraduate psychology subject pool, with two waves completed offline and two online.
Sample sizes. The original study has n1 = 93 for experiment 1 and n2 = 162 (potentially repeated measure over a few weaks). The replication study has N1 = 97 and N2 = 300 for the two experiments.
Variables. There are a few outcome variables, including self-centrality, self-enhancement, self-esteem, etc. In our folder, we cleaned the datasets with easier-to-understand column names, and also provide the data cleaning scripts (adapted from the data sources) for reproducibility.
Results. Experiment 1 showed no significant effect of yoga for enhancing self-centrality, but did (largely) replicated the effect on self-enhancement, self-esteem and commnunal narcissism. The discrepancy was explained by sampling differences in Vaughan-Johnston et al. Experiment 2 showed no significant effect of medication on self-centrality; frequentisy and Bayesian analyses were contrary regarding self-enhancement; however, they found much stronger evidence for well-being effects than the original study.

5. Queueing design and service time dataset

Background. This study investigates the impact of queue design on worker productivity in service systems that involve human servers by varying between multiple parallel queues versus single pooled queue.
Sample sizes. The original study recruited n1 = 248 participants from a public university in US and n2 = 481 participants on M-Turk. The replication study recruited N1 = 246 and N2 = 252 participants for two rounds.
Variables. The outcome variable is median speed. The treatment variable is structure of the queue. Other baseline variables were also measured, including age, gender, device used in the experiment, and managerial experience of the participant.
Results. The original study found the singe-queue structure slows down servers, while the replication study failed to find such effect.

6. Multi-lab disgust and moral judgement dataset

Background. This is a multi-lab replication of an original study from Eskine et al. (2011); unit-level data for the original study is not publicly available to our knowledge. They studied the effect of gustatory disgust on moral judgement, where participants were randomly assigned to bitter, neutral (control), or sweet beverages, and then judged the moral wrongness of six vignettes. We follow the ordering on OSF to clean the datasets and preserve common demographic, manipulation check, and outcome variables.
Sample sizes. The original study had sample size n = 57, while the replication studies had N = 1137 participants in total across k = 11 studies.
Variables. The outcome variable is the average moral rating of the six vignettes. The treatment variable is condition, coded as dummysweet, dummybitter and dummywater in the cleaned datasets. Baseline covariates including religiosity, gender, age, years in colledge, major, ethnicity, potilical orientation, etc. We preserve gender, age, and political orientation for consistency in cleaned data. To evaluate the intended effect of the beverages on subjective ratings (bitter, disgusting, neutral, and sweet) is also assessed, named as check_... in the cleaned data.
Results. The original study showed that gustatory disgust triggers a significantly heightened sense of moral wrongness. In the multi-lab replication study the overall estimates of effect sizes were all smaller than the original study; some were in the opposite direction; all had 0.95 confidence intervals containing zero.

7. Pain and cooperation dataset

Background. Experiment 2 of Bastian et al. (2014) studied the effect of sharing painful experience on intergroup cooperation. Small groups (2-6 people each) of participants performed either two painful or two painless tasks and then played an economic game. Prochazka et al. (2022) conducts a pilot nonpreregistered direct replication and a second preregistered direct replication, with group sizes fixed at three.
Sample sizes. The original study had sample size n = 62. The pilot replication had N = 153 from Czech Republic, and the second preregistered replication had N2 = 158 students from Slovakia.
Variables. The outcome variable is cooperation, the average score from the six games. The treatment variable is condition. We cleaned the datasets by preserving overlapping variables, while the original data additionally contains group size information. Baseline covariates include age and gender. After the experiments, intermediate outcomes such as the level of pain and unpleasantness of sensations were measured as a manipulation check.
Results. The original study found that shared pain increases cooperation among group members. Both replication studies failed to replicate this finding.

8. Cleaniness and moral judgement dataset

Background. This study investigates the impact of physical cleaness on the severity of moral judgement. Participants are randomly assigned to be primed with the concept of cleanliness (Exp.1) and wash hands after experiencing disgust (Exp.2), and then rate six moral vignettes.
Sample sizes. The original study had n1 = 40 for Exp.1 and n2 = 44 for Exp.2. The replication study had N1 = 219 for Exp.1 and N2 = 132 for Exp.2.
Variables. We cleaned the datasets and preserved common covariates in both studies. The outcome variable is vignette, the mean rating in all vignettes. The treatment variable is condition with treatment equal 1. Other variables include the emotionality collected after the experiments.
Results. The original study finds statistically significant effects in both experiments, while Johnson et al. failed to replicate either of them.

9. Lie and language dataset

Background. This study investigates the impact of foreign versus native language on lying. In the original study, German-speaking participants took a lie test where questions were presented randomly in German or English, and they answered with truth or lying in different languages. In the replication study, participants were Dutch-speaking.
Sample sizes. The original study had n = 41 participants, and the replication study had N = 63.
Variables. The measured outcome is the response time for truth-or-lie-telling answers in both languages. In our cleaned data, each row contains the mean response time of a participant (indicated by ID) for questions of different Emotionality, Veracity (Lie or Truth) and Language, as well as the participant's evaluation of emotionality for each category of (Emotionality times Language). Due to limited access, only the replication data contains demographic features including age, gender, major, language proficiency as introduced in Frank, et al., 2019.
Results. The original study showed smaller reaction time differences between lying and truth telling in the foreign compared to thenative language condition, which was mostly driven by prolonged truth responses. The replication study found statistically significant conclusion in the same direction, yet with a smaller effect size.

10. Multi-lab ego depletion dataset

There are two multi-lab replications. Hagger, et al., [2016] failed, but Dang, et al., [2020] succeeded. Dang, et al., [2020] also pointed out inconsistent implementation of the intervention may be a potential reason for the replication failure in Hagger, et al., [2016]. Both OSF links contain datasets for each lab, which includes individual-level characteristics.

11. Honesty and time dataset

Background. This study investigates the impact of time pressure on cheating. In the original study, participants privately roll out a dice and get payment according to their reported amount on the dice (which does not have to be true). The reported amount is used as the outcome.
Sample sizes. The original study had n = 72. The replication study consisted of two experiments; the first one had larger sessions with N1 = 426, another one had the same session size as the original study with N2 = 297.
Variables. The outcome of interest is the reported dice number. The treatment variable (=1) indicates whether there is time pressure (i.e., having to report the dice number in a short time). Data for the original study only contains gender as demographic information. Data for the replication study contains age, gender, education, etc., as demographics, as well as ratings for their belief in the financial incentive and anonymousness of their die roll. The original study and the replication study 1 collected the participants' positive and negative feelings after the experiment; we preserve all such columns and put the common ones before others.
Results. The original study found that time pressure increases cheating, while neither of the replication studies replicated this conclusion.

Details of one-sided studies and datasets

1. climate change misinformation dataset

Background. This study investigates the impact of information communication on protecting against misinformation about climate change.
Sample sizes. The original study had n = 2167. The replication study had N = 792.
Variables. The outcome of interest is the perceved concensus, and there are multiple treatment conditions. We clean the replication dataset (with unit-level data), and the sample mean of demographic information in the original dataset, with processing script included for reproducibility.
Results. The original study had multiple hypotheses; the replication study replicated a susbet of them.

2. Pain-tolerance metaphor dataset

Background. This study investigates the impact of common physical properties (such as 'cond') within a perseverance metaphor on increasing pain tolerance. Participants completed a cold pressor task before and after a randomly allocated intervention of metatphor exercise.
Sample sizes. The original study had n = 87. The replication study had N = 89.
Variables. The outcome of interest is the difference in pain tolerance. We save the replication dataset (with unit-level data), and the sample mean of demographic information in the original dataset.
Results. The original study found that physical metaphor increases pain tolerance, while the replication study did not replicate this result.

3. Body dissatifaction dataset

Background. This study investigates the impact of a computer-based evaluative conditioning (EC) procedure using positive social feedback on enhancing body satisfaction.
Sample sizes. The original study had n = 54. The replication study had N = 129.
Variables. The outcome of interest is the difference in body satisfaction and self-esteem before and after the intervention. We save the replication dataset (unit-level), and the sample mean of demographic information in the original dataset.
Results. The conclusion in the original study was not successfully replicated.

4. Priming and exercise dataset

Background. This study investigates the impact of affective priming as a behavioral intervention on the enhancement of exercise-related affect.
Sample sizes. The original study had n = 54. The replication study had N = 53.
Variables. The outcome of interest is the difference in body satisfaction and self-esteem before and after the intervention. We save the replication dataset (unit-level), and the sample mean of demographic information in the original dataset.
Results. The conclusion in the original study was not successfully replicated. The replication report emphasized potential heterogeneity among people as a potential factor for the failure.

Reference

Please use the following citation if you use this collection in your study, or you use our softwares for analyzing replication studies.

@article{jin2023diagnosing,
  title={Diagnosing the role of observable distribution shift in scientific replications},
  author={Jin, Ying and Guo, Kevin and Rothenh{\"a}usler, Dominik},
  journal={arXiv preprint arXiv:2309.01056},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
host_data_oneside		host_data_oneside
host_data_paired		host_data_paired
host_multisite_cleanliness		host_multisite_cleanliness
.DS_Store		.DS_Store
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
analysis.html		analysis.html

License

ying531/awesome-replicability-data

Folders and files

Latest commit

History

Repository files navigation