Handling multiple runs
Say you want to compare the results of two different analyses on the same set of data. To do this, you can create a separate run for each analysis.
In this tutorial, we’ll first modify our algorithm to calculate either the median or the mean of the input dataset. We’ll then create two runs - one to calculate the median, and one the mean. This allows us to record and compare the output of the two analyses.
Modifying the algorithm
First, let’s modify our algorithm to give us the option of calculating a median instead of a mean.
Modify helloworld/algorithm.py
:
def main(data, function): """Take the average along the long axis with the specified function""" return { "result": getattr(data, function)(axis=1).to_frame(name="result"), }
Now wire up the new function
argument to an input in algorithm.json
:
{ "name": "helloworld", ... "signature": { "inputs": [ { "name": "data", ... }, { "name": "function", "title": "Function", "description": "The function to use for averaging (mean or median)", "type": "string", "null": false, "enum": [ { "title": "Mean", "value": "mean" }, { "title": "Median", "value": "median" } ], "default": { "value": "mean" } } ], "outputs": [ { "name": "result", ... } ] }}
Here we’re adding a simple input that takes a string, and specifying that it can
take only two valid values: mean
and median
with the enum
property.
To see how this works, we can try and set our new input to the wrong value:
dk resetdk init # Ensure we initialise a new run to use the updated algorithm signaturedk set function "mode"
This should throw the following error:
Variable value must be one of ['mean', 'median']
However we can set the value to “median” with no issue:
dk set function "median"
Creating a new run
Let’s create a new run to find the median of this dataset.
dk reset # Reset the datakit to a clean state againdk init helloworld.median
This command tells the CLI tool that we want to create a new run named
helloworld.median
. helloworld
specifies the algorithm that we want to use
for this run, and median
is a user-chosen name for the run. The algorithm name
before the period must match an algorithm listed in datakit.json
. The string
following the period can be anything you like.
You can check which run is currently active by running:
dk get-run
Executing the run
Let’s find the median of our dataset. Run the following:
dk load data data/tabulardata.csv # Load our datadk set function "median" # Set our calculation function to "median"dk run # Run the algorithm
Your result should look like:
dk show result
time | result |
---|---|
0 | 1.92299 |
1 | 0.309318 |
2 | 0.83881 |
3 | 3.8687 |
4 | 3.89235 |
… | … |
The run folder structure
Your datakit should now look like this:
Directoryhelloworld-datakit/
Directoryhelloworld/
- algorithm.json
- algorithm.py
Directoryresources/
- data.json
- result.json
Directoryhelloworld.median.run/
- run.json
Directoryresources/
- data.json
- result.json
Directoryviews/
- …
- datakit.json
As you can see, the resources appear in both the run directory
(helloworld.median.run
) and the algorithm definition directory (helloworld
).
The resources inside the algorithm definition directory serve as templates from which to generate the resources for a run.
The resources inside the run directory represent the actual inputs and outputs that are recorded in the course of running an algorithm.
Additionally, run.json
records the values of any single value variable during
a run (such as our new input variable - function
).
In this way, a run folder contains a complete description of the state of a single execution of the algorithm. This state can then be tracked in a version control repository to maintain a history of the analysis performed. This is covered in the next topic.
Creating a second run
Now, say we want to calculate the mean of the same dataset without losing our previous result. To do this, we can create a new named run:
dk init helloworld.mean
Now we can load our input data, and set the function we want to use to calculate the result:
dk load data data/tabulardata.csvdk set function "mean"
Your result should look like:
dk show result
time | result |
---|---|
0 | 1.41565 |
1 | 0.567011 |
2 | 1.25404 |
3 | 3.81149 |
4 | 4.06375 |
… | … |
Now, since we saved the result of our previous run in helloworld.median
, we
can easily compare the two results:
dk set-run helloworld.mediandk show result # Print the "median" run result
Here they are side by side:
time | result (median) | result (mean) |
---|---|---|
0 | 1.92299 | 1.41565 |
1 | 0.309318 | 0.567011 |
2 | 0.83881 | 1.25404 |
3 | 3.8687 | 3.81149 |
4 | 3.89235 | 4.06375 |
… | … | … |
Comparing run results with an algorithm
To be implemented.