Tabular input

Format

  • A CSV file, with one observation per row, with an arbitrary number of columns.

    • The observation column contains numerical observations, for example a count of people, for each row of data. The header value of the observation column is specified in the configuration file.

    • A set of variable columns specify the category (using its code) of each codebook variable to which the observation relates. The names of these columns, and how they relate to the variables used, are specified in the configuration file.

    • Additional columns may be present and, if not specified in the configuration file, will be implicitly ignored when the dataset is built. These can be used to, for example, contain labels for categories that may need to be included for other purposes.

    • Column names must be unique when compared case-insensitively.

  • A CSV header line is required. It provides machine readable names for the columns of data.

  • The data provided in the CSV file must provide exactly one observation for each combination of categories within the specified variables, with one exception:

    • Where data for a specific rule variable category (such as a geographic area) is not provided due to being suppressed, the set of rows for this rule variable category should be omitted entirely. This is known as redacted data and can be enabled in the configuration file.

  • Input CSV files can be compressed using gzip or bzip2. Cantabular automatically decompresses files ending in .gz or .bz2 as it reads them. The filename in the dataset configuration file should include any extension indicating a compressed format.

Errors

A limited number of errors are reported when processing the tabular input files to avoid generating a very large number of output messages. You can control the limit by setting the CANTABULAR_MAX_TABULAR_ERRORS environment variable to the maximum number of errors that you wish to display. The default value for this setting is 50.

Example tabular input files

Two example input CSV files for a tabular dataset have been provided to illustrate different aspects of their configuration.

Including additional columns

The file example/src/tabular/Example-city+sex+health.csv includes additional columns containing labels for the variable categories. These label columns are ignored by cantabular-make-dataset when the tabular dataset is built, as category labels are sourced from the codebook file for each variable.

The contents of the file are shown below:

city,city_label,sex,sex_label,health,health_label,observation
0,London,0,Male,1,Very good health,1
0,London,0,Male,2,Good health,0
0,London,0,Male,3,Fair health,0
0,London,0,Male,4,Bad health,0
0,London,0,Male,5,Very bad health,0
0,London,1,Female,1,Very good health,0
0,London,1,Female,2,Good health,0
0,London,1,Female,3,Fair health,1
0,London,1,Female,4,Bad health,0
0,London,1,Female,5,Very bad health,0
1,Liverpool,0,Male,1,Very good health,0
1,Liverpool,0,Male,2,Good health,0
1,Liverpool,0,Male,3,Fair health,1
1,Liverpool,0,Male,4,Bad health,0
1,Liverpool,0,Male,5,Very bad health,0
1,Liverpool,1,Female,1,Very good health,1
1,Liverpool,1,Female,2,Good health,0
1,Liverpool,1,Female,3,Fair health,0
1,Liverpool,1,Female,4,Bad health,0
1,Liverpool,1,Female,5,Very bad health,0
2,Belfast,0,Male,1,Very good health,0
2,Belfast,0,Male,2,Good health,1
2,Belfast,0,Male,3,Fair health,0
2,Belfast,0,Male,4,Bad health,0
2,Belfast,0,Male,5,Very bad health,0
2,Belfast,1,Female,1,Very good health,0
2,Belfast,1,Female,2,Good health,0
2,Belfast,1,Female,3,Fair health,1
2,Belfast,1,Female,4,Bad health,0
2,Belfast,1,Female,5,Very bad health,1

Redacted areas

The file example/src/tabular/Example-city+siblings_3+health.csv only provides data for two of the three categories within the city variable, omitting category 2 which has the label Belfast. When this file is loaded by cantabular-make-dataset, and the Redacted field is set to true in the table’s configuration, this missing data is treated as having been redacted at source. When loaded into cantabular-server, all API outputs behave consistently with this.

See the Dataset configuration section for more information on configuring individual tables.

The contents of the file are shown below:

city,siblings_3,health,observation
0,0,1,1
0,0,2,0
0,0,3,0
0,0,4,0
0,0,5,0
0,1-2,1,0
0,1-2,2,0
0,1-2,3,0
0,1-2,4,0
0,1-2,5,0
0,3+,1,0
0,3+,2,0
0,3+,3,1
0,3+,4,0
0,3+,5,0
1,0,1,0
1,0,2,0
1,0,3,0
1,0,4,0
1,0,5,0
1,1-2,1,1
1,1-2,2,0
1,1-2,3,0
1,1-2,4,0
1,1-2,5,0
1,3+,1,0
1,3+,2,0
1,3+,3,1
1,3+,4,0
1,3+,5,0