Skip to content

Using metaschemas

Sometimes, we need to accept tabular data without knowing its exact schema beforehand. For instance, we may want to allow an input dataset with one X column and an unknown number of Y columns.

To handle this, we can use metaschema definitions. Metaschemas describe the allowable structure of a tabular data resource.

In this tutorial, we’ll build on the datakit from the tabular data tutorial. You can find it in the helloworld-datakit repository, under the “tabulardata” algorithm folder.

Algorithm overview

The tabulardata algorithm takes an input dataset with 4 columns - x, y1, y2 and y3, and returns its mean along the x axis.

However, the algorithm function itself is able to take an input dataset with any number of input y columns:

helloworld/algorithmpy
def main(data):
"""Take the average of the input data along the long axis"""
return {
"result": data.mean(axis=1).to_frame(name="result"),
}

How can we modify our algorithm configuration to allow for any shape of input data?

Creating the metaschema

We can do this by defining a new metaschema for our input tabular data resource.

First, create a metaschemas folder in the algorithm configuration directory:

Terminal window
mkdir tabulardata/metaschemas

Now create a metaschema for our input data named tabulardata/metaschemas/data.json:

tabulardata/metaschemas/data.json
{
"name": "data",
"title": "Metaschema for input data",
"profile": "datakit-metaschema",
"schema": {
"primaryKey": "time",
"fields": [
{
"name": "time",
"title": "Time",
"unit": "",
"type": "number",
"index": "0"
},
{
"name": "y",
"title": "Y",
"unit": "",
"type": "number",
"index": "1:"
}
]
}
}

This metaschema matches any tabular data with a single time column, and any number of y columns following it.

Metaschemas are identical to tabular data schemas in syntax, with the addition of an index property that specifies the columns that each field applies to. In this case, the index 1: on the y field specifies that this field applies to every column after the first column.

Applying the metaschema

In order to apply this metaschema to our data input resource, we need to specify it in the algorithm configuration.

Edit tabulardata/algorithm.json:

tabulardata/algorithm.json
{
"name": "tabulardata",
"title": "Time series average",
"description": "A simple algorithm that takes a time series input table and averages its values on the time axis",
"profile": "datakit-algorithm",
"code": "algorithm.py",
"container": "opendatastudio/python-run-base:latest",
"signature": {
"inputs": [
{
"name": "data",
"title": "Data",
"description": "Input time series",
"type": "resource",
"profile": "tabular-data-resource",
"metaschemas": ["data"],
"null": false,
"default": {
"resource": "data",
"metaschema": "data"
}
}
],
"outputs": [
{
"name": "result",
"title": "Result",
"description": "Averaged result",
"type": "resource",
"profile": "tabular-data-resource",
"null": true,
"default": {
"resource": "result"
}
}
]
}
}

The first metaschemas key specifies all of the allowable metaschemas for this variable. In this case, we only have one.

The second metaschema key inside the default object specifies the default metaschema to use when running the algorithm.

Modifying the data resource

Finally, we need to modify the data resource to remove its schema rules. The metaschema will automatically generate its schema for us when the user loads their data.

Open up tabulardata/resources/data.json and replace its schema with an empty object:

tabulardata/resources/data.json
{
"name": "data",
"title": "Data",
"description": "Time series data",
"profile": "tabular-data-resource",
"schema": {},
"data": []
}

Loading data

Let’s test our algorithm by loading some data with 6 y columns. Save the following in data/tabulardata2.csv:

time,y0,y1,y2,y3,y4,y5
0,1,2,3,4,5,6
1,7,8,9,10,11,12
2,13,14,15,16,17,18

Initialise and load it into our algorithm:

Terminal window
dk reset # Remember to reset to remove any previous runs of our algorithm
dk init
dk load data data/tabulardata2.csv

Check it loaded correctly:

Terminal window
dk show data
╭────────┬──────┬──────┬──────┬──────┬──────┬──────╮
│ time │ y0 │ y1 │ y2 │ y3 │ y4 │ y5 │
├────────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │
├────────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ 1 │ 7 │ 8 │ 9 │ 10 │ 11 │ 12 │
├────────┼──────┼──────┼──────┼──────┼──────┼──────┤
│ 2 │ 13 │ 14 │ 15 │ 16 │ 17 │ 18 │
╰────────┴──────┴──────┴──────┴──────┴──────┴──────╯

Open up tabulardata.run/resources/data.json to see your new input data’s generated schema:

tabulardata.run/resources/data.json
...
"schema": {
"fields": [
{
"name": "time",
"title": "time",
"unit": "",
"type": "number"
},
{
"name": "y0",
"title": "y0",
"unit": "",
"type": "number"
},
{
"name": "y1",
"title": "y1",
"unit": "",
"type": "number"
},
{
"name": "y2",
"title": "y2",
"unit": "",
"type": "number"
},
{
"name": "y3",
"title": "y3",
"unit": "",
"type": "number"
},
{
"name": "y4",
"title": "y4",
"unit": "",
"type": "number"
},
{
"name": "y5",
"title": "y5",
"unit": "",
"type": "number"
}
],
"primaryKey": "time"
}
...

Using the metaschema, the datakit has automatically generated a 7-column schema for this dataset.

Executing the algorithm

To check everything is working correctly, we can execute the algorithm:

Terminal window
dk run

And check our result:

Terminal window
dk show result
╭────────┬──────────╮
│ time │ result │
├────────┼──────────┤
│ 0 │ 3.5 │
├────────┼──────────┤
│ 1 │ 9.5 │
├────────┼──────────┤
│ 2 │ 15.5 │
╰────────┴──────────╯

Now you know how to handle dynamic input datasets by using metaschemas in a datakit.