Skip to content

Working with tabular data

In this tutorial, we’ll create a datakit that takes a table containing two columns of time series data, and averages them along the time axis.

Creating resources

To use tabular data in a datakit, we need to define a resource to contain it.

First, reset your datakit to remove any previous runs and return it to a clean state.

Then, create files for your new input data and output result resources under the helloworld algorithm folder:

Terminal window
dk reset
mkdir helloworld/resources
touch helloworld/resources/data.json
touch helloworld/resources/result.json

Your datakit should now look like:

  • Directoryhelloworld-datakit/
    • Directoryhelloworld/
      • algorithm.json
      • algorithm.py
      • Directoryresources/
        • data.json
        • result.json
    • datakit.json

Open up helloworld/resources/data.json and write the following resource configuration:

helloworld/resources/data.json
{
"name": "data",
"title": "Data",
"description": "Time series data",
"profile": "tabular-data-resource",
"schema": {
"primaryKey": "time",
"fields": [
{
"name": "time",
"title": "Time",
"unit": "s",
"type": "number"
},
{
"name": "y1",
"title": "Y1",
"unit": "",
"type": "number"
},
{
"name": "y2",
"title": "Y2",
"unit": "",
"type": "number"
},
{
"name": "y3",
"title": "Y3",
"unit": "",
"type": "number"
}
]
},
"data": []
}

This file defines an empty tabular data resource that has four numerical columns: time, y1, y2 and y3.

Now define the output result resource:

helloworld/resources/result.json
{
"name": "result",
"title": "Result",
"description": "The average of the Y values in the input data",
"profile": "tabular-data-resource",
"schema": {
"primaryKey": "time",
"fields": [
{
"name": "time",
"title": "Time",
"unit": "s",
"type": "number"
},
{
"name": "result",
"title": "Result",
"unit": "",
"type": "number"
}
]
},
"data": []
}

Linking resources to inputs and outputs

In order for our algorithm to have access to our newly minted resources, we need to link them to its input and output variables. Open up helloworld/algorithm.json and modify its inputs and output configurations like so:

helloworld/algorithm.json
{
"name": "helloworld",
...
"signature": {
"inputs": [
{
"name": "data",
"title": "Data",
"description": "Input time series",
"type": "resource",
"profile": "tabular-data-resource",
"null": false,
"default": {
"resource": "data"
}
}
],
"outputs": [
{
"name": "result",
"title": "Result",
"description": "Averaged result",
"type": "resource",
"profile": "tabular-data-resource",
"null": true,
"default": {
"resource": "result"
}
}
]
}
}

This tells our algorithm that the input data should be a tabular data resource that conforms to the schema we defined in helloworld/resources/data.json. In other words, we expect a table with four columns named time, y1, y2 and y3. Similarly, the output should be a table with two columns: time and result.

Modifying the algorithm

Finally, we can modify our algorithm to take the input data and use it to calculate the average Y values. Note that by default, tabular data resources are passed to the algorithm as named pandas DataFrames.

helloworld/algorithm.py
def main(data):
"""Take the average of the input data along the long axis"""
return {
"result": data.mean(axis=1).to_frame(name="result"),
}

Loading data

By convention, we put any input data files under a directory named data at the root of the datakit. Create the data directory, open data/tabulardata.csv and paste the following table:

data/tabulardata.csv
time,y1,y2,y3
0,0.3466236167739929,1.9229864112201938,1.9773385447179885
1,-0.423531510762412,1.815244986837378,0.3093182634287297
2,0.8388100195422679,0.5294092767447518,2.3939153553311407
3,2.948000020477774,4.617762280030524,3.8687019172620136
4,2.45167677777786,5.847212829121721,3.8923485802253577
5,5.4766419552463645,4.72575745115321,8.532849237909025
6,4.879323325885309,5.686661651185716,7.544480088340276
7,8.25876172347388,5.351542885848568,9.593184043973753
8,6.442649354518041,8.124701252873464,12.028011411613441
9,10.17816299109899,11.070663450465652,11.38092186367298
10,10.638275752972232,10.21068506001679,14.502901841924242
11,10.39655032109732,10.075927947286825,14.548589170502087
12,13.752753943890596,11.453232126495813,13.637309525792178
13,12.871660648894213,14.52673208008687,15.977125832484408
14,15.105225370942517,14.906371462715821,16.344717175111533
15,13.082135038555602,13.934245668163236,18.448606908623518
16,17.012134132693777,14.813559971997769,21.021432688836136
17,15.068090876715434,16.30445338493393,22.422137078073558
18,16.356537810690007,19.958208464469358,20.464579389514533
19,20.825401723837388,18.37126725210334,24.895029343034434

Now we can load our data to be analysed. As always, first we need to initialise a new run:

Terminal window
dk init

And now we can load our data into the data resource:

Terminal window
dk load data/tabulardata.csv

Let’s check the data was loaded correctly:

Terminal window
dk show data

This should return something like:

timey1y2y3
00.3466241.922991.97734
1-0.4235321.815240.309318
20.838810.5294092.39392
32.9484.617763.8687

And we’re ready to run the algorithm!

Running the algorithm

Execute the currently active run:

Terminal window
dk run

Now let’s look at our result:

Terminal window
dk show result

This should return:

timeresult
01.41565
10.567011
21.25404
33.81149
44.06375

Here we’ve averaged the y1, y2 and y3 columns together to get the values in result.

Next, we’ll learn about working with multiple runs in a datakit.