Introduction
Handling files is something we all have to deal with, whether code, data or otherwise. If you are an owner of or collaborator on a code base or dataset, there are many ad hoc means at your disposal for changing, copying, sharing, and processing the set of files. This makes managing files in a well-controlled manner tricky:
What changes were made to this file and why?
What is different between Alice’s copy of the files and Bob’s?
Which file changes should be kept, and which can be discarded?
Should these curious files be kept or are they processing artifacts?
Alice and Bob both changed their personal copy of the same file, how can their modifications be reconciled?
What data was used for last month’s run?
Managing a collection of files requires answering such questions. It does not take many changes, copies, collaborators, or weeks of interruption1 before this becomes practically impossible without software assistance. The following pages suggest means of practical software-assisted management that can be adapted to fit your use case.
When you adopt these means, you can set up a workflow — customized for your own or your team’s needs — that allows you to evolve and release well-controlled versions of the file tree, with attached and verified copies of the files located where needed for editing, processing, review, or backup purposes. All under unified management.
This helps make your research reproducible2 and in the long run will save much time and avoid many mistakes relative to managing manually or with outdated tools and workflows. The choice is yours.
- 1
Even when you manage a set of files by yourself, you will be collaborating: with your future you. Your future you may have forgotten what you know now, particularly when returning to the project after a long interruption. You will thank yourself for having managed change properly by keeping good record of the what and why.
- 2
If it is not reproducible, it is not science.