Thoughts on data management for scientists
- 4 minsScratch notes on resources for data management for scientists
Twitter got me to thinking about data management the other day. My sense is that scientists generally get very little training in how to manage their data. There are resources available for people interested in learning, but these tend to cater more towards big-ish datasets (e.g. genomics) where the bioinformatics ecosystem is more mature and good standards/best-practices for handling data are already in place. There’s not a whole lot available focused on the smaller, more specialized kinds of data that show up everywhere in biology.
I’ve spent a lot of time now working with this type of data. I’ve made a lot of mistakes with it, but have also learned a lot along the way. Writing this out to collect some of my thoughts on best practices for managing small data; hopefully I’ll find time to turn this into a more coherent post at some point.
Resources
- Ten Simple Rules for Digital Data Storage and Good enough practices in scientific computing are excellent
- Thinking about the FAIR Principles (in more detail here) is also really useful. Making data publicly accessible (while this is great to do) might be low priority or inapplicable for many kinds of research, but the points about structuring data/metadata are universally useful I think.
- Tidy Data
- The carpentries
- Axiom has a great set of best practices and cheatsheets
Tips
- Pick ONE system for naming your data and stick to it. Naming is a hard problem. A name should correspond to one and only one file, and this should remain true over a reasonably large scope e.g. if you move the file to a new directory (i.e. don’t rely on filesystem structure) or transfer it to a labmate.
- Reasonably good within the scope of a lab is a timestamp + an experimenter/machine ID e.g. 20200623090000_AN. Consider using something truly arbitrary though i.e. a UUID
- Store your data in open formats. If your data are being generated in a proprietary format, convert them to something like HDF5, parquet, csv, tiff, etc. NWB is a great option for many kinds of neuro data.
- Use filesystems to organize your data, but try to keep the tree pretty flat. There are many cases where one folder works great for an entire dataset.
- Make
good
metadata. Explicitly store metadata with the corresponding data. Scientists do crazy shit like having essential metadata (which ephys traces did I apply the drug? what was the genotype of this animal) stored in a single place ON PAPER. If taking notes on paper is part of your workflow, digitizing these needs to also become part of your data acquisition protocol. Keep it simple - making a
.txt
or.csv
file for each experiment is a good choice..xlsx
is not. - Any metadata that can be generated by a machine should be (i.e. nobody should be typing timestamps). For metadata that needs to be generated by a human, data entry should be contemporaneous with data acquisition. Make a plan to catch typos.
- Seriously, make a plan to catch typos. They will happen. More generally, make quality control an explicit step in your data acquisition and allocate time to do this. Ideally, get a second set of eyes. This sucks, but it sucks less than not catching a key error.
- Make your metadata more comprehensive than you think it needs to be.
- Wherever possible, adopt universal conventions for your metadata i.e. use RRIDs. For resources specific to a lab, establish a local consensus on nomenclature (i.e. IDs for animals, equipment, batches of reagents) and stick to it.
- Humans cannot be trusted to protect their data. Make backup happen automatically after acquisition. At least 3 copies, at least one offsite.
- Keep an indelible copy of raw data/metadata. Build analysis pipelines to transform/aggregate/model your data and save the outputs of these pipelines separately.
- Use a scripting language (Python, R) rather than GUI-based programs (Excel, Prism) to process data, run analyses, and generate plots. GUI-based programs lead people toward practices that make data more human-readable but much less machine-readable (i.e. inconsistently using blank rows/columns as spacers separating groups of datapoints in Excel). This can make it necessary to clean data later, which is time-consuming and error prone. Format your raw/processed data in ways that make it easy to interact with programmatically.
- Putting your data into long/narrow form is a really good way to go; I wish I had discovered this much earlier in my research career. Two big advantages: 1) makes it much easier to handle imbalanced/ragged datasets 2) makes it easy to construct a design matrix/regressor for statistical modeling/machine learning
- More broadly, embracing the principles of Tidy Data (see also the closely related concept of 3NF) pretty much always makes things better.
- Write your analysis code so that people (most notably, other scientists and your future self) can read it. Use version control. The carpentries are a great resource
- Write pipelines that can be re-run quickly and painlessly i.e. with minimal human input/oversight. You’ll do this more than you expect.