Skip to content

Handling multiple runs

Say you want to compare the results of two different analyses on the same set of data. To do this, you can create a separate run for each analysis.

In this tutorial, we’ll first modify our algorithm to calculate either the median or the mean of the input dataset. We’ll then create two runs - one to calculate the median, and one the mean. This allows us to record and compare the output of the two analyses.

Modifying the algorithm

First, let’s modify our algorithm to give us the option of calculating a median instead of a mean.

Modify helloworld/algorithm.py:

helloworld/algorithm.py
def main(data, function):
"""Take the average along the long axis with the specified function"""
return {
"result": getattr(data, function)(axis=1).to_frame(name="result"),
}

Now wire up the new function argument to an input in algorithm.json:

"helloworld/algorithm.json
{
"name": "helloworld",
...
"signature": {
"inputs": [
{
"name": "data",
...
},
{
"name": "function",
"title": "Function",
"description": "The function to use for averaging (mean or median)",
"type": "string",
"null": false,
"enum": [
{
"title": "Mean",
"value": "mean"
},
{
"title": "Median",
"value": "median"
}
],
"default": {
"value": "mean"
}
}
],
"outputs": [
{
"name": "result",
...
}
]
}
}

Here we’re adding a simple input that takes a string, and specifying that it can take only two valid values: mean and median with the enum property.

To see how this works, we can try and set our new input to the wrong value:

Terminal window
dk reset
dk init # Ensure we initialise a new run to use the updated algorithm signature
dk set function "mode"

This should throw the following error:

Variable value must be one of ['mean', 'median']

However we can set the value to “median” with no issue:

Terminal window
dk set function "median"

Creating a new run

Let’s create a new run to find the median of this dataset.

Terminal window
dk reset # Reset the datakit to a clean state again
dk init helloworld.median

This command tells the CLI tool that we want to create a new run named helloworld.median. helloworld specifies the algorithm that we want to use for this run, and median is a user-chosen name for the run. The algorithm name before the period must match an algorithm listed in datakit.json. The string following the period can be anything you like.

You can check which run is currently active by running:

Terminal window
dk get-run

Executing the run

Let’s find the median of our dataset. Run the following:

Terminal window
dk load data data/tabulardata.csv # Load our data
dk set function "median" # Set our calculation function to "median"
dk run # Run the algorithm

Your result should look like:

Terminal window
dk show result
timeresult
01.92299
10.309318
20.83881
33.8687
43.89235

The run folder structure

Your datakit should now look like this:

  • Directoryhelloworld-datakit/
    • Directoryhelloworld/
      • algorithm.json
      • algorithm.py
      • Directoryresources/
        • data.json
        • result.json
    • Directoryhelloworld.median.run/
      • run.json
      • Directoryresources/
        • data.json
        • result.json
      • Directoryviews/
    • datakit.json

As you can see, the resources appear in both the run directory (helloworld.median.run) and the algorithm definition directory (helloworld).

The resources inside the algorithm definition directory serve as templates from which to generate the resources for a run.

The resources inside the run directory represent the actual inputs and outputs that are recorded in the course of running an algorithm.

Additionally, run.json records the values of any single value variable during a run (such as our new input variable - function).

In this way, a run folder contains a complete description of the state of a single execution of the algorithm. This state can then be tracked in a version control repository to maintain a history of the analysis performed. This is covered in the next topic.

Creating a second run

Now, say we want to calculate the mean of the same dataset without losing our previous result. To do this, we can create a new named run:

Terminal window
dk init helloworld.mean

Now we can load our input data, and set the function we want to use to calculate the result:

Terminal window
dk load data data/tabulardata.csv
dk set function "mean"

Your result should look like:

Terminal window
dk show result
timeresult
01.41565
10.567011
21.25404
33.81149
44.06375

Now, since we saved the result of our previous run in helloworld.median, we can easily compare the two results:

Terminal window
dk set-run helloworld.median
dk show result # Print the "median" run result

Here they are side by side:

timeresult (median)result (mean)
01.922991.41565
10.3093180.567011
20.838811.25404
33.86873.81149
43.892354.06375

Comparing run results with an algorithm

To be implemented.