Dataset configuration¶
The specification of both microdata and tabular datasets is achieved using a configuration file in the JSON data format (see https://www.json.org/).
Files specified in the configuration file can be compressed using gzip or bzip2. Cantabular automatically decompresses
files ending in .gz
or .bz2
as it reads them. The filenames in the dataset configuration file should include
any extension indicating a compressed format.
The config file must contain a single JSON object. The available configuration fields vary depending whether the dataset is a microdata or a tabular dataset.
Microdata-only settings¶
Configuration settings for microdata-based datasets should be nested within the Microdata
field described above. Valid fields within this are shown below:
File
(string)
Path to file containing input CSV data.
Mandatory
WeightColumn
(string)
Name of column containing the weight that the row contributes when totalling during the aggregation step. Each string in this column should be a decimal number greater than zero and less than or equal to 32767.
Default: unused: all rows contribute one to the count in aggregation.
SelectColumns
(array of strings)
Column names to select from the input CSV. Used if you want to restrict the variables that are available.
Default: all columns are used.
IgnoreColumns
(array of strings)
Column names to ignore from the input CSV. Typical usage is to ignore a unique “ID” column.
Default: all columns are used.
SelectMappings
(array of strings)
Mapping names to use only if you don’t want all of them. All mappings that are rule variables are automatically included.
Default: all mappings are used.
IgnoreMappings
(array of strings)
Mapping names to remove from codebook if you don’t want all of them.
Default: all mappings are used.
SelectRows
(array of criteria objects)
Used to select rows from the input CSV file. Use only if you don’t want all the rows, e.g. for a population dataset.
Each criteria object has two members:
Column
string containing the column name.
Values
array containing list of strings of allowed values.
FilterOnlyVariables
(array of strings)
Names of variables which cannot be used as query variables. They may only be used to filter the categories of lower-level sources.
Default: no variables are filter-only variables.
ConvertToBaseVariable
(array of strings)
Names of mappings which will be converted to base variables. These must be direct mappings of microdata variables. There can be no more than one entry per microdata variable. The source microdata variables will be removed from the dataset.
Perturbation
(object)
Contains all the parameters related to perturbation. It has the following members:
Disabled
(bool)
If true then all perturbation is disabled. No perturbation table or category keys will be read or generated, and all row keys will be zeroed. No other settings in
Perturbation
should be specified whenDisabled
is true. If false thenPTable
,CategoryKeys
andRowKeys
must be specified.Default: false
PTable
(object)
Contains parameters related to the perturbation table. It has the following members.
Generate
(bool)
If true then a random perturbation table is generated.
Default: false
File
(string)
Path to a file containing the perturbation table. This must be specified when
Generate
(above) is false and must not be specified whenGenerate
is true.RepeatFromCellValue
(number: integer > 0)
Specifies the cell value at which rows in the perturbation table should be repeated in order to handle arbitrarily large cell values. This must be specified when
Generate
(above) is false and must not be specified whenGenerate
is true.
RowKeys
(object)
Contains parameters related to row keys. It has the following members:
Generate
(bool)
If true then the row keys are randomly generated.
Column
(string)
Name of column containing the row key used for the Cell Key Method. This must be specified when
Generate
is false and must not be specified whenGenerate
is true.
CategoryKeys
(object)
Contains parameters related to category keys. It has the following members:
Generate
(bool)
If true then the category keys are randomly generated.
File
(string)
Path to the file containing a category key for every code. This must be specified when
Generate
is false and must not be specified whenGenerate
is true.Header and format of CSV file is:
variable name,category code,category key
Mappings
(string)
Method used to determine the category keys for univariate mappings. Must be specified if univariate mappings are present in the codebook and
Generate
is false.Must be one of:
FromCategoryKeysFile
SumSources
MultivariateMappings
(string)
Method used to determine the category keys for multivariate mappings. Must be specified if multivariate mappings are present in the codebook and
Generate
is false.Must be one of:
SumLastVariable
FromCategoryKeysFile
SumAllMappings
SumFirstMapping
SumLastMapping
SumFirstVariable
StructuralZeros
(object)
Contains parameters related to structural zeros. It has the following members:
Variables
(array of strings)
Names of rule variables to use as structural zeros variables when performing zero perturbation. The entries are processed in order. Each source of a specified variable uses that variable as the structural zeros variable unless it has already been assigned to a previous entry in the array. Queries on any sources of these variables have zero perturbation applied. An equivalent query using the appropriate structural zeros variable is used to determine zeros that should not be perturbed.
Default: zero perturbation is never applied.
PercentOnesToZero
(number: decimal >= 0.0 and <= 100.0)
Percentage of cells with unperturbed value of one at the structural zeros variable perturbed down to zero to match the number of zeros perturbed up. This thus determines number of zeros perturbed for a query.
Default: 29.999000
RulesFile
(string)
Path to file containing rules code in the disclosure rules language (DRL). See “Disclosure Rules Language” in product documentation for details.
Tabular-only settings¶
Configuration settings for tabular datasets should be nested within the Tabular
field as described above. Valid fields within this are shown below:
Tables
(array of table objects)
Used to specify the individual pre-computed tables that should be loaded into a tabular dataset. The fields within each of these are documented below.
Mandatory
Each table object has a set of five possible members, described below:
DataColumn
(string)
Name of column containing observation value for each row.
Mandatory
Redacted
(boolean)
Whether to expect redacted data for some rule variable categories.
Default: false.
File
(string)
Path to file containing table data.
Mandatory
Variables
(array of strings)
Names of variables in the order they are to appear in the table. The variable column headings in the data are assumed to match these names, unless
VariableColumns
is specified, see below.Mandatory
VariableColumns
(array of strings)
Names of columns containing variables, if these differ from the corresponding variable names.
Default: unused
Example microdata dataset configuration¶
The configuration file for the example microdata dataset is shown below:
{
"Name": "Example",
"Description": "Example microdata dataset for validation",
"CodebookFile": "codebook",
"RuleBaseVariable": "city",
"Microdata": {
"File": "src/microdata/microdata.csv",
"Perturbation": {
"RowKeys": {
"Column": "RKEY"
},
"CategoryKeys": {
"File": "src/microdata/catkeys.csv",
"Mappings": "SumSources"
},
"PTable": {
"File": "src/shared/ptable.csv",
"RepeatFromCellValue": 3
}
},
"RulesFile": "src/microdata/example-rules.txt"
}
}
Example tabular dataset configuration¶
The configuration file for the example tabular dataset is shown below:
{
"Name": "Example-Tabular",
"Description": "Example tabular dataset for validation",
"CodebookFile": "codebook",
"RuleBaseVariable": "city",
"Tabular": {
"Tables": [
{
"File": "src/tabular/Example-city+sex+health.csv",
"DataColumn": "observation",
"Variables": [
"city",
"sex",
"health"
]
},
{
"File": "src/tabular/Example-city+siblings_3+health.csv",
"DataColumn": "observation",
"Redacted": true,
"Variables": [
"city",
"siblings_3",
"health"
]
}
]
}
}
Example tabular with decimals dataset configuration¶
The configuration file for the example tabular dataset with decimal data (using an observation type) is shown below:
{
"Name": "Example-Tabular-Decimals",
"Description": "Example tabular dataset with decimals for validation",
"CodebookFile": "codebook",
"RuleBaseVariable": "city",
"ObservationTypes": [
{
"Name": "EARN",
"Label": "Earnings (in GBP)",
"Description": "Gross monthly earnings",
"DecimalPlaces": 2,
"Format": {
"Prefix": "£",
"Suffix": "pw",
"FillTrailingZeros": true
}
}
],
"Tabular": {
"Tables": [
{
"File": "src/tabular/Example-earnings-city+sex+health.csv",
"DataColumn": "observation",
"Variables": [
"city",
"sex",
"earnings",
"health"
]
}
]
}
}
Example flow dataset configuration¶
A flow dataset is a microdata dataset with more than one rule base variable.
The configuration file for the example flow dataset is shown below:
{
"Name": "Example-Flow",
"Description": "Example flow dataset for validation",
"CodebookFile": "codebook",
"CodebookIndexFile": "codebook_with_city_work.csv",
"RuleBaseVariables": [
"city",
"city_work"
],
"Microdata": {
"File": "src/flow/microdata.csv",
"Perturbation": {
"RowKeys": {
"Column": "RKEY"
},
"PTable": {
"File": "src/shared/ptable.csv",
"RepeatFromCellValue": 3
}
},
"RulesFile": "src/flow/rules.txt"
}
}