Skip to content

Datakit overview

Datakits are based on the open-source Frictionless Data Package specification.

At the most basic level, a datakit is a git repository consisting of .json configuration files describing each element of a data analysis.

The following reference will use bindfit-datakit as an example, a datakit containing a binding constant fitting algorithm.

An example of an initialised datakit structure is below:

  • Directorybindfit/ # The algorithm configuration
    • algorithm.json
    • algorithm.py
    • relationships.json
    • Directoryresources/ # Resource templates
      • data.json
      • inputParams.json
      • outputParams.json
      • fit.json
    • Directorymetaschemas/ # Metaschemas
      • data.json
      • dataAgg.json
    • Directorycontainer/
      • Dockerfile
    • Directoryviews/ # Visualisations
      • fitGraphMatplotlib.json
      • fitGraphMatplotlib.py
    • Directoryinterfaces/ # User interfaces
      • main.json
  • Directorybindfit.run/ # The run state configuration - contains a single run of the algorithm
    • run.json
    • Directoryresources/ # Resource instances
      • data.json
      • inputParams.json
      • outputParams.json
      • fit.json
    • Directoryviews/
      • # View artefacts generated by the run go here
  • datakit.json # Global datakit configuration

Data

Simple data

Simple data in a datakit is defined by individual variable values inside the run configuration. For example, model below is a simple string value:

"bindfit.run/run.json
{
"name": "bindfit.run",
"title": "Run configuration for bindfit",
...
"data": {
"inputs": [
...
{
"name": "model",
"value": "nmr1to1"
},
...
],
"outputs": [
...
]
}
}

The configuration for this simple variable is defined inside the algorithm signature:

bindfit/algorithm.json
{
"name": "bindfit",
...
"signature": {
"inputs": [
...
{
"name": "model",
"title": "Model",
"description": "The model to fit to the data",
"type": "string",
"enum": [
...
],
"null": false,
"default": {
"value": "nmr1to1"
}
},
...
],
"outputs": [
...
]
}
}

Tabular data resources

Tabular data is described by the Frictionless Tabular Data Resource specification.

  • Directorybindfit/
    • Directoryresources/
      • data.json
      • inputParams.json
      • outputParams.json
      • fit.json

Schemas

Tabular data resources contain schemas which define their structure.

Metaschemas

Metaschemas are a datakit concept that define the range of tabular data schemas an algorithm variable can accept. See the metaschemas tutorial.

Algorithms

Algorithms define the analysis code in a workflow.

Signature

All algorithms have a signature. The algorithm signature defines the inputs and outputs that the algorithm can take. A signature consists of many variables.

Run configurations

Run configurations describe a single execution of the algorithm they refer to. They contain a description of the setup of a single algorithm run, linking algorithm variables to input and output resources along with the metaschemas they must conform to (if present).

Views

Views describe visualisations of algorithm data which can be rendered individually, or embedded as web components.

Interfaces

Interfaces describe how to render a collection of views into an interactive user interface.