Perturbation¶
This section is relevant only to microdata-based datasets.
Cantabular applies cell key perturbation on output tables from microdata-based datasets. The amount of perturbation applied to each cell in the table is determined by perturbation parameters that are specified when the dataset is created.
Some cells with a value of 0 are also perturbed upwards using zero perturbation. The parameters for this algorithm are also specified when the dataset is created.
Cell key parameters¶
Row keys¶
Row keys (also called record keys) are assigned to each record in the microdata. Each record should have a row key with a value between 1 and 255. The keys can be embedded in the microdata, with the column name containing the keys specified in the configuration file.
Alternatively Cantabular can be configured to automatically generate random keys.
Perturbation table¶
The perturbation table (or ptable) specifies how much perturbation should be applied to a cell with a given
frequency count and cell key.
The
ptableshould be specified as a CSV file with 3 columnscell_value,cell_keyandperturbation.cell_valueis the frequency count.cell_keyis the cell key value (determined by modulo 256 summing of all the record keys that contribute to the cell).perturbationis the perturbation value i.e. how much a cell with a givencell_valueandcell_keyshould be perturbed.An alternative legacy header is also supported with columns
pcv,ckeyandpvalue.pcvcorresponds tocell_value,ckeycorresponds tocell_keyandpvaluecorresponds toperturbation.
There should be one entry for each combination of
cell_valueandcell_key.A
-operator can be used to specify an inclusive range ofcell_keyvalues. This allows multiple entries to be specified on a single line in cases where the sameperturbationis applied to cells with a givencell_valueover a range of consecutivecell_keyvalues. The following entry specifies that aperturbationof -2 should be applied to cells with acell_valueof 5, where thecell_keyis in the range 4 to 28:5,4-28,-2
No entries should be specified with a
cell_valueof zero. The cell key algorithm is not applied to cells with acell_valueof zero.The maximum frequency count for a cell is equal to the number of records in a dataset.
cell_keyvalues range from 0 to 255.perturbationvalues range from -128 to 127.It is not necessary to supply entries for all values of
cell_valueup to the record count. Insteadptableentries are reused. Thecell_valueof the first row that is reused for high cell values is configurable in the dataset JSON file via theRepeatFromCellValueparameter.If the
ptablehas entries for 102 rows and theRepeatFromCellValueparameter is set to 3: the entries forcell_valuevalues of 1 and 2 are used exclusively for values of 1 and 2 whilst the entries forcell_value3-102 are used repeatedly for cells with acell_valueof 3-102, 103-202, 203-302 etc.
An abridged version of the ptable included with the example microdata dataset is shown below. It contains entries for cell_value
1 to 4. The RepeatFromCellValue parameter is set to 3. Hence, the entries for cell_value 3 and 4 will
be reused for values of 5-6, 7-8 etc.
It can be seen that if a cell has a cell_value of 1 and a cell_key of 3, then a perturbation of -1 is applied to the cell.
example/src/shared/ptable.csv:
cell_value,cell_key,perturbation
1,0-2,0
1,3,-1
1,4-16,0
1,17,-1
1,18-19,0
1,20,1
1,21,0
...
4,216-226,0
4,227,1
4,228-241,0
4,242,1
4,243-255,0
Cantabular can be configured to generate a random ptable if one is not specified.
Zero perturbation parameters¶
Category keys¶
Category keys are used to identify which cells with a frequency count of 0 should be perturbed upwards.
Every category for each variable should have a category key.
The category keys can be specified as a CSV file with three columns,
variable name,category codeandcategory key.The category keys should have a value between 1 and 65535.
The category key file should have entries for all variables that represent columns in the microdata.
Entries for mapped variables (with a single source variable) can either be specified in the category keys file or derived by summing the category keys for the contributing source categories. If the dataset contains mapped variables then a method for determining the category keys must be specified in the configuration file.
If the dataset contains multivariate mappings then a method for determining the category keys must be specified in the configuration file.
Alternatively Cantabular can be configured to generate random category keys.
The category key file supplied with the example microdata dataset is shown below.
example/src/microdata/catkeys.csv:
variable name,category code,category key
city,0,23898
city,1,53015
city,2,10564
sex,0,60090
sex,1,50310
siblings,0,7126
siblings,1,20954
siblings,2,27394
siblings,3,2003
siblings,4,27061
siblings,5,18056
siblings,6,36261
Structural zeros variables¶
An array consisting of the names of rule variables to use as structural zeros variables when performing zero perturbation. The entries are processed in order. Each source of a specified variable uses that variable as the structural zeros variable unless it has already been assigned to a previous entry in the array. Queries on any sources of these variables have zero perturbation applied. An equivalent query using the appropriate structural zeros variable is used to determine zeros that should not be perturbed.
A base variable cannot be used as a structural zeros variables. Nor can a variable that is a source of another structural zeros variable.
Percentage of ones perturbed to zero¶
The percentage of ones perturbed to zero is specified in the configuration file. This is the percentage of cells with unperturbed value of one at the structural zeros variable perturbed down to zero to match the number of zeros perturbed up. This thus determines number of zeros perturbed for a query. The default is 29.999000%.