Working with tabular data
In this tutorial, we’ll create a datakit that takes a table containing two columns of time series data, and averages them along the time axis.
Creating resources
To use tabular data in a datakit, we need to define a resource to contain it.
First, reset your datakit to remove any previous runs and return it to a clean state.
Then, create files for your new input data
and output result
resources under
the helloworld
algorithm folder:
dk resetmkdir helloworld/resourcestouch helloworld/resources/data.jsontouch helloworld/resources/result.json
Your datakit should now look like:
Directoryhelloworld-datakit/
Directoryhelloworld/
- algorithm.json
- algorithm.py
Directoryresources/
- data.json
- result.json
- datakit.json
Open up helloworld/resources/data.json
and write the following resource
configuration:
{ "name": "data", "title": "Data", "description": "Time series data", "profile": "tabular-data-resource", "schema": { "primaryKey": "time", "fields": [ { "name": "time", "title": "Time", "unit": "s", "type": "number" }, { "name": "y1", "title": "Y1", "unit": "", "type": "number" }, { "name": "y2", "title": "Y2", "unit": "", "type": "number" }, { "name": "y3", "title": "Y3", "unit": "", "type": "number" } ] }, "data": []}
This file defines an empty tabular data resource that has four numerical
columns: time
, y1
, y2
and y3
.
Now define the output result
resource:
{ "name": "result", "title": "Result", "description": "The average of the Y values in the input data", "profile": "tabular-data-resource", "schema": { "primaryKey": "time", "fields": [ { "name": "time", "title": "Time", "unit": "s", "type": "number" }, { "name": "result", "title": "Result", "unit": "", "type": "number" } ] }, "data": []}
Linking resources to inputs and outputs
In order for our algorithm to have access to our newly minted resources, we need
to link them to its input and output variables. Open up
helloworld/algorithm.json
and modify its inputs and output configurations like
so:
{ "name": "helloworld", ... "signature": { "inputs": [ { "name": "data", "title": "Data", "description": "Input time series", "type": "resource", "profile": "tabular-data-resource", "null": false, "default": { "resource": "data" } } ], "outputs": [ { "name": "result", "title": "Result", "description": "Averaged result", "type": "resource", "profile": "tabular-data-resource", "null": true, "default": { "resource": "result" } } ] }}
This tells our algorithm that the input data
should be a tabular data resource
that conforms to the schema we defined in helloworld/resources/data.json
. In
other words, we expect a table with four columns named time
, y1
, y2
and
y3
. Similarly, the output should be a table with two columns: time
and
result
.
Modifying the algorithm
Finally, we can modify our algorithm to take the input data and use it to calculate the average Y values. Note that by default, tabular data resources are passed to the algorithm as named pandas DataFrames.
def main(data): """Take the average of the input data along the long axis""" return { "result": data.mean(axis=1).to_frame(name="result"), }
Loading data
By convention, we put any input data files under a directory named data
at the
root of the datakit. Create the data
directory, open data/tabulardata.csv
and paste the following table:
time,y1,y2,y30,0.3466236167739929,1.9229864112201938,1.97733854471798851,-0.423531510762412,1.815244986837378,0.30931826342872972,0.8388100195422679,0.5294092767447518,2.39391535533114073,2.948000020477774,4.617762280030524,3.86870191726201364,2.45167677777786,5.847212829121721,3.89234858022535775,5.4766419552463645,4.72575745115321,8.5328492379090256,4.879323325885309,5.686661651185716,7.5444800883402767,8.25876172347388,5.351542885848568,9.5931840439737538,6.442649354518041,8.124701252873464,12.0280114116134419,10.17816299109899,11.070663450465652,11.3809218636729810,10.638275752972232,10.21068506001679,14.50290184192424211,10.39655032109732,10.075927947286825,14.54858917050208712,13.752753943890596,11.453232126495813,13.63730952579217813,12.871660648894213,14.52673208008687,15.97712583248440814,15.105225370942517,14.906371462715821,16.34471717511153315,13.082135038555602,13.934245668163236,18.44860690862351816,17.012134132693777,14.813559971997769,21.02143268883613617,15.068090876715434,16.30445338493393,22.42213707807355818,16.356537810690007,19.958208464469358,20.46457938951453319,20.825401723837388,18.37126725210334,24.895029343034434
Now we can load our data to be analysed. As always, first we need to initialise a new run:
dk init
And now we can load our data into the data
resource:
dk load data/tabulardata.csv
Let’s check the data was loaded correctly:
dk show data
This should return something like:
time | y1 | y2 | y3 |
---|---|---|---|
0 | 0.346624 | 1.92299 | 1.97734 |
1 | -0.423532 | 1.81524 | 0.309318 |
2 | 0.83881 | 0.529409 | 2.39392 |
3 | 2.948 | 4.61776 | 3.8687 |
… | … | … | … |
And we’re ready to run the algorithm!
Running the algorithm
Execute the currently active run:
dk run
Now let’s look at our result:
dk show result
This should return:
time | result |
---|---|
0 | 1.41565 |
1 | 0.567011 |
2 | 1.25404 |
3 | 3.81149 |
4 | 4.06375 |
… | … |
Here we’ve averaged the y1
, y2
and y3
columns together to get the values
in result
.
Next, we’ll learn about working with multiple runs in a datakit.