Datakit overview
Datakits are based on the open-source Frictionless Data Package specification.
At the most basic level, a datakit is a git repository consisting of .json
configuration files describing each element of a data analysis.
The following reference will use bindfit-datakit as an example, a datakit containing a binding constant fitting algorithm.
An example of an initialised datakit structure is below:
Directorybindfit/ # The algorithm configuration
- algorithm.json
- algorithm.py
- relationships.json
Directoryresources/ # Resource templates
- data.json
- inputParams.json
- outputParams.json
- fit.json
- …
Directorymetaschemas/ # Metaschemas
- data.json
- dataAgg.json
Directorycontainer/
- Dockerfile
- …
Directoryviews/ # Visualisations
- fitGraphMatplotlib.json
- fitGraphMatplotlib.py
- …
Directoryinterfaces/ # User interfaces
- main.json
Directorybindfit.run/ # The run state configuration - contains a single run of the algorithm
- run.json
Directoryresources/ # Resource instances
- data.json
- inputParams.json
- outputParams.json
- fit.json
Directoryviews/
- … # View artefacts generated by the run go here
- datakit.json # Global datakit configuration
Data
Simple data
Simple data in a datakit is defined by individual variable values inside the run
configuration. For example, model
below is a simple string value:
{ "name": "bindfit.run", "title": "Run configuration for bindfit", ... "data": { "inputs": [ ... { "name": "model", "value": "nmr1to1" }, ... ], "outputs": [ ... ] }}
The configuration for this simple variable is defined inside the algorithm
signature
:
{ "name": "bindfit", ... "signature": { "inputs": [ ... { "name": "model", "title": "Model", "description": "The model to fit to the data", "type": "string", "enum": [ ... ], "null": false, "default": { "value": "nmr1to1" } }, ... ], "outputs": [ ... ] }}
Tabular data resources
Tabular data is described by the Frictionless Tabular Data Resource specification.
Directorybindfit/
- …
Directoryresources/
- data.json
- inputParams.json
- outputParams.json
- fit.json
- …
- …
Schemas
Tabular data resources contain schemas which define their structure.
Metaschemas
Metaschemas are a datakit concept that define the range of tabular data schemas an algorithm variable can accept. See the metaschemas tutorial.
Algorithms
Algorithms define the analysis code in a workflow.
Signature
All algorithms have a signature. The algorithm signature defines the inputs and outputs that the algorithm can take. A signature consists of many variables.
Run configurations
Run configurations describe a single execution of the algorithm they refer to. They contain a description of the setup of a single algorithm run, linking algorithm variables to input and output resources along with the metaschemas they must conform to (if present).
Views
Views describe visualisations of algorithm data which can be rendered individually, or embedded as web components.
Interfaces
Interfaces describe how to render a collection of views into an interactive user interface.