Tracking with Git

This guide explains how to track datakit runs in a version control repository using Git.

One of the key advantages of using datakits is the ability to version control your entire analysis process. By tracking these changes with Git, every step of your analysis becomes reproducible and shareable, allowing others with access to your repository to replicate your work.

This is a brief introduction for those who are new to Git to get started with tracking datakit changes.

Installing Git

Before getting started, you’ll need to install Git. Follow the instructions here for your operating system.

Initialising your repository

Once Git is installed, you can initialise your datakit as a Git repository:

cd helloworld-datakit       # Navigate to your datakit root folder
git init                        # Initialise a Git repository

Now we can check the status of your repository:

git status

You should see a list of “untracked files” - these are files Git is not yet monitoring for changes.

Let’s add and commit all files to start tracking them:

git add --all
git commit -m "Initial commit"

Now, all files in your datakit are tracked. You can revert to this state at any time if needed.

Tracking changes

After running an analysis in your datakit, some files will be modified.

For example, if you run the following commands:

dk init
dk load data data/tabulardata.csv
dk run

Git will detect changes in your repository. You can check this by running:

git status

You might see output like this:

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   datakit.json

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        multipleruns.run/

no changes added to commit (use "git add" and/or "git commit -a")

Git has noticed changes, but they haven’t been committed yet. To save these changes, use:

git commit -am "Put a description of your run here"

It’s important to commit after each significant run you want to preserve. If you don’t, your changes may not be saved and could be overwritten by subsequent runs.

Publishing to GitHub

To share your datakit or make it available publicly, you can upload your repository to GitHub.

First, create a new repository on GitHub and copy its URL.

Now, link your local repository to the remote GitHub repository:

git remote add origin https://github.com/your-account/your-repository.git

Push your changes to GitHub:

git push origin mian

Your datakit is now published to GitHub and can be accessed by others if your repository is set to public.