Perturbation

This section is relevant only to microdata-based datasets.

Cantabular applies cell key perturbation on output tables from microdata-based datasets. The amount of perturbation applied to each cell in the table is determined by perturbation parameters that are specified when the dataset is created.

Some cells with a value of 0 are also perturbed upwards using zero perturbation. The parameters for this algorithm are also specified when the dataset is created.

Cell key parameters

Row keys

Row keys (also called record keys) are assigned to each record in the microdata. Each record should have a row key with a value between 1 and 255. The keys can be embedded in the microdata, with the column name containing the keys specified in the configuration file.

Alternatively Cantabular can be configured to automatically generate random keys.

Perturbation table

The perturbation table (or ptable) specifies how much perturbation should be applied to a cell with a given frequency count and cell key.

  • The ptable should be specified as a CSV file with 3 columns cell_value, cell_key and perturbation.

    • cell_value is the frequency count.

    • cell_key is the cell key value (determined by modulo 256 summing of all the record keys that contribute to the cell).

    • perturbation is the perturbation value i.e. how much a cell with a given cell_value and cell_key should be perturbed.

    • An alternative legacy header is also supported with columns pcv, ckey and pvalue. pcv corresponds to cell_value, ckey corresponds to cell_key and pvalue corresponds to perturbation.

  • There should be one entry for each combination of cell_value and cell_key.

    • A - operator can be used to specify an inclusive range of cell_key values. This allows multiple entries to be specified on a single line in cases where the same perturbation is applied to cells with a given cell_value over a range of consecutive cell_key values. The following entry specifies that a perturbation of -2 should be applied to cells with a cell_value of 5, where the cell_key is in the range 4 to 28:

      5,4-28,-2
      
  • No entries should be specified with a cell_value of zero. The cell key algorithm is not applied to cells with a cell_value of zero.

  • The maximum frequency count for a cell is equal to the number of records in a dataset.

  • cell_key values range from 0 to 255.

  • perturbation values range from -128 to 127.

  • It is not necessary to supply entries for all values of cell_value up to the record count. Instead ptable entries are reused. The cell_value of the first row that is reused for high cell values is configurable in the dataset JSON file via the RepeatFromCellValue parameter.

  • If the ptable has entries for 102 rows and the RepeatFromCellValue parameter is set to 3: the entries for cell_value values of 1 and 2 are used exclusively for values of 1 and 2 whilst the entries for cell_value 3-102 are used repeatedly for cells with a cell_value of 3-102, 103-202, 203-302 etc.

An abridged version of the ptable included with the example microdata dataset is shown below. It contains entries for cell_value 1 to 4. The RepeatFromCellValue parameter is set to 3. Hence, the entries for cell_value 3 and 4 will be reused for values of 5-6, 7-8 etc.

It can be seen that if a cell has a cell_value of 1 and a cell_key of 3, then a perturbation of -1 is applied to the cell.

example/src/shared/ptable.csv:

cell_value,cell_key,perturbation
1,0-2,0
1,3,-1
1,4-16,0
1,17,-1
1,18-19,0
1,20,1
1,21,0
...
4,216-226,0
4,227,1
4,228-241,0
4,242,1
4,243-255,0

Cantabular can be configured to generate a random ptable if one is not specified.

Zero perturbation parameters

Category keys

Category keys are used to identify which cells with a frequency count of 0 should be perturbed upwards.

  • Every category for each variable should have a category key.

  • The category keys can be specified as a CSV file with three columns, variable name, category code and category key.

    • The category keys should have a value between 1 and 65535.

    • The category key file should have entries for all variables that represent columns in the microdata.

  • Entries for mapped variables (with a single source variable) can either be specified in the category keys file or derived by summing the category keys for the contributing source categories. If the dataset contains mapped variables then a method for determining the category keys must be specified in the configuration file.

  • If the dataset contains multivariate mappings then a method for determining the category keys must be specified in the configuration file.

  • Alternatively Cantabular can be configured to generate random category keys.

The category key file supplied with the example microdata dataset is shown below.

example/src/microdata/catkeys.csv:

variable name,category code,category key
city,0,23898
city,1,53015
city,2,10564
sex,0,60090
sex,1,50310
siblings,0,7126
siblings,1,20954
siblings,2,27394
siblings,3,2003
siblings,4,27061
siblings,5,18056
siblings,6,36261

Structural zeros variables

An array consisting of the names of rule variables to use as structural zeros variables when performing zero perturbation. The entries are processed in order. Each source of a specified variable uses that variable as the structural zeros variable unless it has already been assigned to a previous entry in the array. Queries on any sources of these variables have zero perturbation applied. An equivalent query using the appropriate structural zeros variable is used to determine zeros that should not be perturbed.

A base variable cannot be used as a structural zeros variables. Nor can a variable that is a source of another structural zeros variable.

Percentage of ones perturbed to zero

The percentage of ones perturbed to zero is specified in the configuration file. This is the percentage of cells with unperturbed value of one at the structural zeros variable perturbed down to zero to match the number of zeros perturbed up. This thus determines number of zeros perturbed for a query. The default is 29.999000%.