Working with tabular data

In this tutorial, we’ll create a datakit that takes a table containing two columns of time series data, and averages them along the time axis.

Creating resources

To use tabular data in a datakit, we need to define a resource to contain it.

First, reset your datakit to remove any previous runs and return it to a clean state.

Then, create files for your new input data and output result resources under the helloworld algorithm folder:

dk reset
mkdir helloworld/resources
touch helloworld/resources/data.json
touch helloworld/resources/result.json

Your datakit should now look like:

Directoryhelloworld-datakit/
- Directoryhelloworld/
  - algorithm.json
  - algorithm.py
  - Directoryresources/
    data.json
    result.json
- datakit.json

Open up helloworld/resources/data.json and write the following resource configuration:

{
  "name": "data",
  "title": "Data",
  "description": "Time series data",
  "profile": "tabular-data-resource",
  "schema": {
    "primaryKey": "time",
    "fields": [
      {
        "name": "time",
        "title": "Time",
        "unit": "s",
        "type": "number"
      },
      {
        "name": "y1",
        "title": "Y1",
        "unit": "",
        "type": "number"
      },
      {
        "name": "y2",
        "title": "Y2",
        "unit": "",
        "type": "number"
      },
      {
        "name": "y3",
        "title": "Y3",
        "unit": "",
        "type": "number"
      }
    ]
  },
  "data": []
}

This file defines an empty tabular data resource that has four numerical columns: time, y1, y2 and y3.

Now define the output result resource:

{
  "name": "result",
  "title": "Result",
  "description": "The average of the Y values in the input data",
  "profile": "tabular-data-resource",
  "schema": {
    "primaryKey": "time",
    "fields": [
      {
        "name": "time",
        "title": "Time",
        "unit": "s",
        "type": "number"
      },
      {
        "name": "result",
        "title": "Result",
        "unit": "",
        "type": "number"
      }
    ]
  },
  "data": []
}

Linking resources to inputs and outputs

In order for our algorithm to have access to our newly minted resources, we need to link them to its input and output variables. Open up helloworld/algorithm.json and modify its inputs and output configurations like so:

{
  "name": "helloworld",
  ...
  "signature": {
    "inputs": [
      {
        "name": "data",
        "title": "Data",
        "description": "Input time series",
        "type": "resource",
        "profile": "tabular-data-resource",
        "null": false,
        "default": {
          "resource": "data"
        }
      }
    ],
    "outputs": [
      {
        "name": "result",
        "title": "Result",
        "description": "Averaged result",
        "type": "resource",
        "profile": "tabular-data-resource",
        "null": true,
        "default": {
          "resource": "result"
        }
      }
    ]
  }
}

This tells our algorithm that the input data should be a tabular data resource that conforms to the schema we defined in helloworld/resources/data.json. In other words, we expect a table with four columns named time, y1, y2 and y3. Similarly, the output should be a table with two columns: time and result.

Modifying the algorithm

Finally, we can modify our algorithm to take the input data and use it to calculate the average Y values. Note that by default, tabular data resources are passed to the algorithm as named pandas DataFrames.

def main(data):
    """Take the average of the input data along the long axis"""
    return {
        "result": data.mean(axis=1).to_frame(name="result"),
    }

Loading data

By convention, we put any input data files under a directory named data at the root of the datakit. Create the data directory, open data/tabulardata.csv and paste the following table:

time,y1,y2,y3
0,0.3466236167739929,1.9229864112201938,1.9773385447179885
1,-0.423531510762412,1.815244986837378,0.3093182634287297
2,0.8388100195422679,0.5294092767447518,2.3939153553311407
3,2.948000020477774,4.617762280030524,3.8687019172620136
4,2.45167677777786,5.847212829121721,3.8923485802253577
5,5.4766419552463645,4.72575745115321,8.532849237909025
6,4.879323325885309,5.686661651185716,7.544480088340276
7,8.25876172347388,5.351542885848568,9.593184043973753
8,6.442649354518041,8.124701252873464,12.028011411613441
9,10.17816299109899,11.070663450465652,11.38092186367298
10,10.638275752972232,10.21068506001679,14.502901841924242
11,10.39655032109732,10.075927947286825,14.548589170502087
12,13.752753943890596,11.453232126495813,13.637309525792178
13,12.871660648894213,14.52673208008687,15.977125832484408
14,15.105225370942517,14.906371462715821,16.344717175111533
15,13.082135038555602,13.934245668163236,18.448606908623518
16,17.012134132693777,14.813559971997769,21.021432688836136
17,15.068090876715434,16.30445338493393,22.422137078073558
18,16.356537810690007,19.958208464469358,20.464579389514533
19,20.825401723837388,18.37126725210334,24.895029343034434

Now we can load our data to be analysed. As always, first we need to initialise a new run:

dk init

And now we can load our data into the data resource:

dk load data/tabulardata.csv

Let’s check the data was loaded correctly:

dk show data

This should return something like:

time	y1	y2	y3
0	0.346624	1.92299	1.97734
1	-0.423532	1.81524	0.309318
2	0.83881	0.529409	2.39392
3	2.948	4.61776	3.8687
…	…	…	…

And we’re ready to run the algorithm!

Running the algorithm

Execute the currently active run:

dk run

Now let’s look at our result:

dk show result

This should return:

time	result
0	1.41565
1	0.567011
2	1.25404
3	3.81149
4	4.06375
…	…

Here we’ve averaged the y1, y2 and y3 columns together to get the values in result.

Next, we’ll learn about working with multiple runs in a datakit.