Overview¶
This document describes the process to follow when loading a new dataset into Cantabular.
Everything required by Cantabular to present a dataset to a user is compiled into a single dataset file with a filename ending in “.dat” for deployment simplicity and runtime performance.
Cantabular datasets can be based on one of two types of data. Either a single microdata file, where a row represents a set of observations about a statistical unit such as a person, or a set of tabular files where a row represents a single observation such as a count of people with specific characteristics.
cantabular-make-datasetis a program that produces this dataset(.dat)file from both the source data and a codebook, described in the Codebook section of this document. The program is namedcantabular-make-dataset.exeon Windows.The program is controlled by a JSON configuration file whose filename must be supplied on the command line.
The example microdata dataset can be built using the supplied
example/example.jsonconfiguration file.The example tabular dataset can be built using the supplied
example/example-tabular.jsonconfiguration file.The example flow dataset can be built using the supplied
example/example-flow.jsonconfiguration file.The runtime of this program is short, taking only a few minutes to produce a
.datfile from tens of millions of rows of data.
cantabular-make-datasetcontains built-in documentation of its command-line arguments and environment variables.Run the program with no command line parameters to display the documentation.
The help as displayed by the program should be taken as definitive for that version of the software.
Much of the data for Cantabular is provided in comma-separated values (CSV) format. We used the format described in RFC 4180 encoded in UTF-8: https://tools.ietf.org/html/rfc4180.