Overview

This document describes the process to follow when loading a new dataset into Cantabular.

  • Everything required by Cantabular to present a dataset to a user is compiled into a single dataset file with a filename ending in “.dat” for deployment simplicity and runtime performance.

  • Cantabular datasets can be based on one of two types of data. Either a single microdata file, where a row represents a set of observations about a statistical unit such as a person, or a set of tabular files where a row represents a single observation such as a count of people with specific characteristics.

  • cantabular-make-dataset is a program that produces this dataset (.dat) file from both the source data and a codebook, described in the Codebook section of this document. The program is named cantabular-make-dataset.exe on Windows.

    • The program is controlled by a JSON configuration file whose filename must be supplied on the command line.

    • The example microdata dataset can be built using the supplied example/example.json configuration file.

    • The example tabular dataset can be built using the supplied example/example-tabular.json configuration file.

    • The example flow dataset can be built using the supplied example/example-flow.json configuration file.

    • The runtime of this program is short, taking only a few minutes to produce a .dat file from tens of millions of rows of data.

  • cantabular-make-dataset contains built-in documentation of its command-line arguments and environment variables.

    • Run the program with no command line parameters to display the documentation.

    • The help as displayed by the program should be taken as definitive for that version of the software.

    • Much of the data for Cantabular is provided in comma-separated values (CSV) format. We used the format described in RFC 4180 encoded in UTF-8: https://tools.ietf.org/html/rfc4180.