Overview¶
This document describes the process to follow when loading a new dataset into Cantabular.
Everything required by Cantabular to present a dataset to a user is compiled into a single dataset file with a filename ending in “.dat” for deployment simplicity and runtime performance.
Cantabular datasets can be based on one of two types of data. Either a single microdata file, where a row represents a set of observations about a statistical unit such as a person, or a set of tabular files where a row represents a single observation such as a count of people with specific characteristics.
cantabular-make-dataset
is a program that produces this dataset(.dat)
file from both the source data and a codebook, described in the Codebook section of this document. The program is namedcantabular-make-dataset.exe
on Windows.The program is controlled by a JSON configuration file whose filename must be supplied on the command line.
The example microdata dataset can be built using the supplied
example/example.json
configuration file.The example tabular dataset can be built using the supplied
example/example-tabular.json
configuration file.The example flow dataset can be built using the supplied
example/example-flow.json
configuration file.The runtime of this program is short, taking only a few minutes to produce a
.dat
file from tens of millions of rows of data.
cantabular-make-dataset
contains built-in documentation of its command-line arguments and environment variables.Run the program with no command line parameters to display the documentation.
The help as displayed by the program should be taken as definitive for that version of the software.
Much of the data for Cantabular is provided in comma-separated values (CSV) format. We used the format described in RFC 4180 encoded in UTF-8: https://tools.ietf.org/html/rfc4180.