Specialized Data Stores

Not all data management fits into the pattern of evolving a file tree. Sometimes a specialized data store such as a geo-spatial database or a relational database is needed to attain sufficient performance. Such data stores index the data for fast access and provide some kind of query interface or language for data retrieval and updates. When introducing such a data store to a project, you risk losing out on many of the benefits of unified code and data management via the Git ecosystem as discussed on the prior pages since there is now an additional place where to perform updates that bypasses Git. Moreover, such data stores introduce their own complexities such as hosting, maintenance, and the need for an additional access control mechanism. This page suggests ways to mitigate these risks and complexities.

Do you really need one?

Carefully consider whether a specialized data store is really required. Sometimes they are added to a project for relatively frivolous reasons, for example when just a few data files with an efficient data structure such as a hash table would be perfectly fine and performant for meeting requirements.

It may be that the developers are familiar with using a particular data store together with their preferred coding framework. This will ease development, but may end up burdening the later data management in addition to adding complexities. These costs should be weighed when making a decision on introducing a specialized data store.

Use read-only stores

Speed of access is the main reason for requiring a specialized data store. The indexing and selective querying allow you to quickly retrieve just the information needed, for example for interactive visualization. This use case can be covered by read-only access to the data store: only queries are allowed, updates are disallowed. Instead, the data in the data store is updated and re-indexed only on release of the data at the end of a well-managed workflow.

Update/initialize from a Git (LFS) repository

By providing scripting to update/initialize the data store from Git repository — or a Git LFS repository when the data is large — you can bring the post-update or instantiation state of the data store under unified Git management. This can be useful for:

  • Ensuring that the data store contains only well-vetted and verified data.

  • Publishing a practical means for others to to set up their own data store.

    • Needed because the code and data of the project won’t be usable when published without a means to obtain an instance of the specialized data store.

  • Unified management of the the initial state of a test instance of the data store as used for development or integration testing, together with the code.

Exporting to a Git LFS repository

Specialized data stores typically allow the data they contain to be exported to files. These files will be large and can be stored in a Git LFS repository for purposes of:

  • Publishing an evolved version of the data for the public or external researchers to use when setting up their own instance.

    • For example to enable reproduction of results associated with a publication.

  • Backup with the option to retrieve a history of restore points.

  • Managing data store backups together with matching code and data.

    • For example to be able to reproduce past states of the overall system.

Note

When using a read-only data store, there is no need to export: the data does not change, and the data-store can therefore be re-instantiated, or the data re-updated, from its original data source.

Containerized store

Instead of deploying the data store on server infrastructure with system administration requirements, consider instantiating the store in a self-managed container via scripting. This reduces operational complexities.

The container can be transient when the data is pulled in from a Git LFS repository on instantiation and, in case of data stores that are not read-only, exported to a Git LFS repository on destruction. See above.

This approach makes it easier for others to reproduce your processing setup and hence your research. It also makes it easy to set up additional instances of the data store, for example for development or integration testing. When done testing, the instances can be destroyed.

If updates to the data really have to be done through the data store — for example to ensure referential integrity via transactional updates on a relational database — a containerized store will make it easier to set up a transient management instance against which those updates are performed.