Reproducibility is the cornerstone of trustworthy research and analysis.
And we definitely need some more! The “Reproducibility crisis” is so strong that it even has an entry on wikipedia 🙈.
As Data Analysts and Data Visualization professionals, we bear the responsibility to restore trust by championing reproducibility standards in our work. That’s why I’m working hard on my workflow project! 😀
Today, I would like to summarise the 12 advices I repeat again and again to get a reproducible study with R, Python or any other tool.
1️⃣ Never alter raw data.
Ever.
Upon acquiring a dataset, store it in its dedicated folder.
Access its content solely through scripts. Avoid duplication. Any modifications must be made programmatically as the initial step in your pipeline.
Manual changes risk irretrievable errors.
2️⃣ Embrace self-contained projects.
Ensure all elements needed for your results reside within the same folder. Handing this folder to another person should enable them to execute the analysis seamlessly.
Failure to do so risks accidental file deletions, leading to irrecoverable analyses.
In R, R Studio projects are of great helps.
3️⃣ Automate everything.
Manual interventions are forbidden.
Operations should be executable via 1 click or a script, guaranteeing reproducibility.
Dependence on manual steps risks forgotten processes over time, jeopardizing the integrity of your analysis.
4️⃣ Exclusively use relative paths.
Utilizing absolute paths, such as ~/NicolasComputer/Desktop/myFile.txt, renders your analysis fragile.
Relocating files out of Nicola’s computer breaks the entire process, resulting in lost analyses.
Use relative paths instead!
In R, do not use the setwd() function anymore!
5️⃣ Prioritize clean code
Write clean code from the beginning.
Employ descriptive variable names, develop functions, write comments and document thoroughly…
Such efforts benefit future users, including your future self.
Do it now, you won’t do it tomorrow.
6️⃣ Test your code.
Test complex functions with dummy datasets to ensure expected outcomes. Failures on simple datasets indicate potential issues with larger ones.
7️⃣ Directly publish results from code.
Avoid manual result copying to .doc files to mitigate errors.
Leverage tools like Quarto or Jupyter for direct report generation, enhancing productivity.
8️⃣ Conduct full-scale reruns periodically.
Running analyses from scratch helps uncover environment inconsistencies that may arise from overlooked variables.
9️⃣ Document your environment configuration.
Utilize functions like sessionInfo() in R to capture package details and versions.
This documentation proves invaluable in future debugging, ensuring compatibility with evolving software versions.
In R you can do even better with the renv package.
🔟 Use version control with Git.
Track project changes meticulously to understand who made alterations and when.
This practice prevents chaos and safeguards against confusing file names like result_2_final.png.
Trust me, being able to go back in time will make you feel so much safer!
1️⃣1️⃣ Store data online.
Eliminate reliance on vulnerable hard drives by hosting analyses on platforms like GitHub.
Bonus: it allows transforming reports into websites which enhances accessibility and resilience.
If you don’t apply those guidelines already, check my productive workflow project! In less than 3 hours I guarantee your data analysis project will look hundred times better.
Have a good, reproductible day!
Yan