Dataset configuration

The specification of both microdata and tabular datasets is achieved using a configuration file in the JSON data format (see https://www.json.org/).

Files specified in the configuration file can be compressed using gzip or bzip2. Cantabular automatically decompresses files ending in .gz or .bz2 as it reads them. The filenames in the dataset configuration file should include any extension indicating a compressed format.

The config file must contain a single JSON object. The available configuration fields vary depending whether the dataset is a microdata or a tabular dataset.

Shared configuration settings

Some settings are common to both microdata-based and tabular datasets:

Name

(string)

Used to identify the dataset in the API. The only valid characters in Name are alphanumeric characters, ‘-‘ and ‘_’.

Mandatory

Description

(string)

Used to describe the dataset in the user interface.

Mandatory

OutputDir

(string)

Path to directory in which to place output data file. The output data file name has the same base (prefix) as the dataset config file (JSON) but has a .dat suffix.

Default: directory containing the dataset config file (JSON)

CodebookFile

(string)

Path to a Zip file or directory containing the codebook of allowed values for each variable (column) in the input data file.

Mandatory

CodebookIndexFile

(string)

Path to the index file within the codebook which lists the variables.

Default: codebook.csv

AllowDuplicateCategoryLabels

(array of strings)

Names of variables whose category labels are not unique.

Default: categories must have unique labels on a per-variable basis.

RuleBaseVariables

(array of strings)

Names of the variables that are the source of all other rule variables and against which disclosure rules are evaluated. Datasets with two or more rule base variables are flow datasets and are not currently loaded by cantabular-server or cantabular-admin.

Default: no rule variables.

RuleBaseVariable

(string)

Equivalent to RuleBaseVariables with a single name. Deprecated. Use RuleBaseVariables instead.

Microdata

(microdata object)

Defines a dataset based on microdata with disclosure controls.

Either this or ``Tabular`` must be supplied.

Tabular

(tabular object)

Defines a dataset in terms of a collection of tabular data sharing a common codebook.

Either this or ``Microdata`` must be supplied.

ObservationTypes

(array of observation type objects)

Observation types, common to all data in this dataset. These define the name, label, and other properties of data values in this dataset. These are analogous to units for the data (e.g. hectares, percent, confidence etc.).

Default: “Count” observation type, storing whole integer numbers.

Each observation type has the following possible members, described below:

Name

(string)

Unique name of the observation type.

Mandatory

Label

(string)

Human-readable label of the observation type.

Mandatory

Description

(string)

Human-readable description of the observation type.

DecimalPlaces

(integer)

Number of decimal places of the observation type. This property determines the maximum number of decimal places to which data will be supplied and output. For example, at 2 decimal places, values of 1, 1.2, 1.23 are valid but 1.234 is not.

Default: 0 (only integers are valid)

Format

(format object)

The format object contains information to support the display of observations.

The format object has the following possible members, described below:

Prefix

(string)

The prefix contains text that should be placed before each value when it is presented to end users. For example, a prefix might be “£”.

Default: No prefix

Suffix

(string)

The suffix contains text that should be placed after each value when it is presented to end users. For example, a suffix might, in the case of hectares, be “ha.”.

Default: No suffix

Multiplier

(integer, power of 10)

The multiplier should be applied to values before display. For example, a multiplier of 100 would be used to display the value 0.8 as 80.

Default: 1

FillTrailingZeros

(boolean)

Indicates that the value when output should be filled with trailing zeros up to the number of decimal places determined by the DecimalPlaces property from the observation type.

Default: false

Microdata-only settings

Configuration settings for microdata-based datasets should be nested within the Microdata field described above. Valid fields within this are shown below:

File

(string)

Path to file containing input CSV data.

Mandatory

WeightColumn

(string)

Name of column containing the weight that the row contributes when totalling during the aggregation step. Each string in this column should be a decimal number greater than zero and less than or equal to 32767.

Default: unused: all rows contribute one to the count in aggregation.

SelectColumns

(array of strings)

Column names to select from the input CSV. Used if you want to restrict the variables that are available.

Default: all columns are used.

IgnoreColumns

(array of strings)

Column names to ignore from the input CSV. Typical usage is to ignore a unique “ID” column.

Default: all columns are used.

SelectMappings

(array of strings)

Mapping names to use only if you don’t want all of them. All mappings that are rule variables are automatically included.

Default: all mappings are used.

IgnoreMappings

(array of strings)

Mapping names to remove from codebook if you don’t want all of them.

Default: all mappings are used.

SelectRows

(array of criteria objects)

Used to select rows from the input CSV file. Use only if you don’t want all the rows, e.g. for a population dataset.

Each criteria object has two members:

Column

string containing the column name.

Values

array containing list of strings of allowed values.

FilterOnlyVariables

(array of strings)

Names of variables which cannot be used as query variables. They may only be used to filter the categories of lower-level sources.

Default: no variables are filter-only variables.

ConvertToBaseVariable

(array of strings)

Names of mappings which will be converted to base variables. These must be direct mappings of microdata variables. There can be no more than one entry per microdata variable. The source microdata variables will be removed from the dataset.

Perturbation

(object)

Contains all the parameters related to perturbation. It has the following members:

Disabled

(bool)

If true then all perturbation is disabled. No perturbation table or category keys will be read or generated, and all row keys will be zeroed. No other settings in Perturbation should be specified when Disabled is true. If false then PTable, CategoryKeys and RowKeys must be specified.

Default: false

PTable

(object)

Contains parameters related to the perturbation table. It has the following members.

Generate

(bool)

If true then a random perturbation table is generated.

Default: false

File

(string)

Path to a file containing the perturbation table. This must be specified when Generate (above) is false and must not be specified when Generate is true.

RepeatFromCellValue

(number: integer > 0)

Specifies the cell value at which rows in the perturbation table should be repeated in order to handle arbitrarily large cell values. This must be specified when Generate (above) is false and must not be specified when Generate is true.

RowKeys

(object)

Contains parameters related to row keys. It has the following members:

Generate

(bool)

If true then the row keys are randomly generated.

Column

(string)

Name of column containing the row key used for the Cell Key Method. This must be specified when Generate is false and must not be specified when Generate is true.

CategoryKeys

(object)

Contains parameters related to category keys. It has the following members:

Generate

(bool)

If true then the category keys are randomly generated.

File

(string)

Path to the file containing a category key for every code. This must be specified when Generate is false and must not be specified when Generate is true.

Header and format of CSV file is: variable name,category code,category key

Mappings

(string)

Method used to determine the category keys for univariate mappings. Must be specified if univariate mappings are present in the codebook and Generate is false.

Must be one of:

FromCategoryKeysFile

SumSources

MultivariateMappings

(string)

Method used to determine the category keys for multivariate mappings. Must be specified if multivariate mappings are present in the codebook and Generate is false.

Must be one of:

SumLastVariable

FromCategoryKeysFile

SumAllMappings

SumFirstMapping

SumLastMapping

SumFirstVariable

StructuralZeros

(object)

Contains parameters related to structural zeros. It has the following members:

Variables

(array of strings)

Names of rule variables to use as structural zeros variables when performing zero perturbation. The entries are processed in order. Each source of a specified variable uses that variable as the structural zeros variable unless it has already been assigned to a previous entry in the array. Queries on any sources of these variables have zero perturbation applied. An equivalent query using the appropriate structural zeros variable is used to determine zeros that should not be perturbed.

Default: zero perturbation is never applied.

PercentOnesToZero

(number: decimal >= 0.0 and <= 100.0)

Percentage of cells with unperturbed value of one at the structural zeros variable perturbed down to zero to match the number of zeros perturbed up. This thus determines number of zeros perturbed for a query.

Default: 29.999000

RulesFile

(string)

Path to file containing rules code in the disclosure rules language (DRL). See “Disclosure Rules Language” in product documentation for details.

Tabular-only settings

Configuration settings for tabular datasets should be nested within the Tabular field as described above. Valid fields within this are shown below:

Tables

(array of table objects)

Used to specify the individual pre-computed tables that should be loaded into a tabular dataset. The fields within each of these are documented below.

Mandatory

Each table object has a set of five possible members, described below:

DataColumn

(string)

Name of column containing observation value for each row.

Mandatory

Redacted

(boolean)

Whether to expect redacted data for some rule variable categories.

Default: false.

File

(string)

Path to file containing table data.

Mandatory

Variables

(array of strings)

Names of variables in the order they are to appear in the table. The variable column headings in the data are assumed to match these names, unless VariableColumns is specified, see below.

Mandatory

VariableColumns

(array of strings)

Names of columns containing variables, if these differ from the corresponding variable names.

Default: unused

Example microdata dataset configuration

The configuration file for the example microdata dataset is shown below:

{
  "Name": "Example",
  "Description": "Example microdata dataset for validation",
  "CodebookFile": "codebook",
  "RuleBaseVariable": "city",
  "Microdata": {
    "File": "src/microdata/microdata.csv",
    "Perturbation": {
      "RowKeys": {
        "Column": "RKEY"
      },
      "CategoryKeys": {
        "File": "src/microdata/catkeys.csv",
        "Mappings": "SumSources"
      },
      "PTable": {
        "File": "src/shared/ptable.csv",
        "RepeatFromCellValue": 3
      }
    },
    "RulesFile": "src/microdata/example-rules.txt"
  }
}

Example tabular dataset configuration

The configuration file for the example tabular dataset is shown below:

{
  "Name": "Example-Tabular",
  "Description": "Example tabular dataset for validation",
  "CodebookFile": "codebook",
  "RuleBaseVariable": "city",
  "Tabular": {
    "Tables": [
      {
        "File": "src/tabular/Example-city+sex+health.csv",
        "DataColumn": "observation",
        "Variables": [
          "city",
          "sex",
          "health"
        ]
      },
      {
        "File": "src/tabular/Example-city+siblings_3+health.csv",
        "DataColumn": "observation",
        "Redacted": true,
        "Variables": [
          "city",
          "siblings_3",
          "health"
        ]
      }
    ]
  }
}

Example tabular with decimals dataset configuration

The configuration file for the example tabular dataset with decimal data (using an observation type) is shown below:

{
  "Name": "Example-Tabular-Decimals",
  "Description": "Example tabular dataset with decimals for validation",
  "CodebookFile": "codebook",
  "RuleBaseVariable": "city",
  "ObservationTypes": [
    {
      "Name": "EARN",
      "Label": "Earnings (in GBP)",
      "Description": "Gross monthly earnings",
      "DecimalPlaces": 2,
      "Format": {
        "Prefix": "£",
        "Suffix": "pw",
        "FillTrailingZeros": true
      }
    }
  ],
  "Tabular": {
    "Tables": [
      {
        "File": "src/tabular/Example-earnings-city+sex+health.csv",
        "DataColumn": "observation",
        "Variables": [
          "city",
          "sex",
          "earnings",
          "health"
        ]
      }
    ]
  }
}

Example flow dataset configuration

A flow dataset is a microdata dataset with more than one rule base variable.

The configuration file for the example flow dataset is shown below:

{
  "Name": "Example-Flow",
  "Description": "Example flow dataset for validation",
  "CodebookFile": "codebook",
  "CodebookIndexFile": "codebook_with_city_work.csv",
  "RuleBaseVariables": [
    "city",
    "city_work"
  ],
  "Microdata": {
    "File": "src/flow/microdata.csv",
    "Perturbation": {
      "RowKeys": {
        "Column": "RKEY"
      },
      "PTable": {
        "File": "src/shared/ptable.csv",
        "RepeatFromCellValue": 3
      }
    },
    "RulesFile": "src/flow/rules.txt"
  }
}