Understanding Git

Having seen how various use cases can be accommodated by workflows that leverage the Git ecosystem for managing code and data, it is important to gain a better understanding of Git itself. Instead of diving into Git’s many features, which can be confusing, let’s take a step back to gain an overview.

In essence, Git allows you to tightly manage the evolution of trees of files. Each new tree version resulting from changes to a tree (a commit) is stored by Git in a repository in unmodifiable and retrievable form, with a unique identifier.[1] This unique identifier is a so-called commit hash[2] that is obtained by applying a cryptographic hash function to:

The file changes
The parent commit hash
The commit message
The commit author
The committer (not necessarily the same as the author)
The date + time

None of these can change without invalidating the commit hash. This also holds for the parent of the parent, all the way back to an empty initial repository: the hash fixes the whole history of changes and hence also the state of the file tree at the time of the commit. The same hash cannot[3] result from the same file tree but with a different sequence of changes leading up to it. It cannot result from different authors making the very same changes, or the same authors committing the very same changes at a different time or with a different descriptive message.

You can therefore think of a Git repository as a store that preserves the content, provenance, and retrievability of the entire history of a tree of files while giving tight control over changes. This is essential because copying and changing information has become so cheap and easy that files tend start to lead a life of their own: being copied at will, modified ad hoc, with ownership and responsibility soon lost.

This does not mean that Git makes it difficult to copy and change information. Quite the reverse. Git embraces the convenience of easily making copies and applying changes for many purposes, yet gives you the tools to nevertheless keep track of change and consolidate significant changes in a repository. As a result, the validated/canonical/official/upstream copy of a file tree can be maintained in a well-managed manner inside a repository even when a storm of copying, revision, development, and experimentation is going on.

That, in a nutshell, is the power of Git.

Warning

Git does not manage code and data for you, it helps you manage. You will need to tell Git what is important and should be tracked, and what can be ignored. You will need to describe why you made a change so that Git can keep a record. You will need an overview of your code or dataset and a goal for its evolution. It is perfectly possible to use Git and still make a mess.

In short, Git is stupid, it needs your insight and intelligence.