Overview¶

This document describes the process to follow when loading a new dataset into Cantabular.

Everything required by Cantabular to present a dataset to a user is compiled into a single dataset file with a filename ending in “.dat” for deployment simplicity and runtime performance.
Cantabular datasets can be based on one of two types of data. Either a single microdata file, where a row represents a set of observations about a statistical unit such as a person, or a set of tabular files where a row represents a single observation such as a count of people with specific characteristics.
cantabular-make-dataset is a program that produces this dataset (.dat) file from both the source data and a codebook, described in the Codebook section of this document. The program is named cantabular-make-dataset.exe on Windows.
- The program is controlled by a JSON configuration file whose filename must be supplied on the command line.
- The example microdata dataset can be built using the supplied example/example.json configuration file.
- The example tabular dataset can be built using the supplied example/example-tabular.json configuration file.
- The example flow dataset can be built using the supplied example/example-flow.json configuration file.
- The runtime of this program is short, taking only a few minutes to produce a .dat file from tens of millions of rows of data.
cantabular-make-dataset contains built-in documentation of its command-line arguments and environment variables.
- Run the program with no command line parameters to display the documentation.
- The help as displayed by the program should be taken as definitive for that version of the software.
- Much of the data for Cantabular is provided in comma-separated values (CSV) format. We used the format described in RFC 4180 encoded in UTF-8: https://tools.ietf.org/html/rfc4180.