Perturbation¶
This section is relevant only to microdata-based datasets.
Cantabular applies cell key perturbation on output tables from microdata-based datasets. The amount of perturbation applied to each cell in the table is determined by perturbation parameters that are specified when the dataset is created.
Some cells with a value of 0 are also perturbed upwards using zero perturbation. The parameters for this algorithm are also specified when the dataset is created.
Cell key parameters¶
Row keys¶
Row keys (also called record keys) are assigned to each record in the microdata. Each record should have a row key with a value between 1 and 255. The keys can be embedded in the microdata, with the column name containing the keys specified in the configuration file.
Alternatively Cantabular can be configured to automatically generate random keys.
Perturbation table¶
The perturbation table (or ptable
) specifies how much perturbation should be applied to a cell with a given
frequency count and cell key.
The
ptable
should be specified as a CSV file with 3 columnscell_value
,cell_key
andperturbation
.cell_value
is the frequency count.cell_key
is the cell key value (determined by modulo 256 summing of all the record keys that contribute to the cell).perturbation
is the perturbation value i.e. how much a cell with a givencell_value
andcell_key
should be perturbed.An alternative legacy header is also supported with columns
pcv
,ckey
andpvalue
.pcv
corresponds tocell_value
,ckey
corresponds tocell_key
andpvalue
corresponds toperturbation
.
There should be one entry for each combination of
cell_value
andcell_key
.A
-
operator can be used to specify an inclusive range ofcell_key
values. This allows multiple entries to be specified on a single line in cases where the sameperturbation
is applied to cells with a givencell_value
over a range of consecutivecell_key
values. The following entry specifies that aperturbation
of -2 should be applied to cells with acell_value
of 5, where thecell_key
is in the range 4 to 28:5,4-28,-2
No entries should be specified with a
cell_value
of zero. The cell key algorithm is not applied to cells with acell_value
of zero.The maximum frequency count for a cell is equal to the number of records in a dataset.
cell_key
values range from 0 to 255.perturbation
values range from -128 to 127.It is not necessary to supply entries for all values of
cell_value
up to the record count. Insteadptable
entries are reused. Thecell_value
of the first row that is reused for high cell values is configurable in the dataset JSON file via theRepeatFromCellValue
parameter.If the
ptable
has entries for 102 rows and theRepeatFromCellValue
parameter is set to 3: the entries forcell_value
values of 1 and 2 are used exclusively for values of 1 and 2 whilst the entries forcell_value
3-102 are used repeatedly for cells with acell_value
of 3-102, 103-202, 203-302 etc.
An abridged version of the ptable
included with the example microdata dataset is shown below. It contains entries for cell_value
1 to 4. The RepeatFromCellValue
parameter is set to 3. Hence, the entries for cell_value
3 and 4 will
be reused for values of 5-6, 7-8 etc.
It can be seen that if a cell has a cell_value
of 1 and a cell_key
of 3, then a perturbation of -1 is applied to the cell.
example/src/shared/ptable.csv
:
cell_value,cell_key,perturbation
1,0-2,0
1,3,-1
1,4-16,0
1,17,-1
1,18-19,0
1,20,1
1,21,0
...
4,216-226,0
4,227,1
4,228-241,0
4,242,1
4,243-255,0
Cantabular can be configured to generate a random ptable
if one is not specified.
Zero perturbation parameters¶
Category keys¶
Category keys are used to identify which cells with a frequency count of 0 should be perturbed upwards.
Every category for each variable should have a category key.
The category keys can be specified as a CSV file with three columns,
variable name
,category code
andcategory key
.The category keys should have a value between 1 and 65535.
The category key file should have entries for all variables that represent columns in the microdata.
Entries for mapped variables (with a single source variable) can either be specified in the category keys file or derived by summing the category keys for the contributing source categories. If the dataset contains mapped variables then a method for determining the category keys must be specified in the configuration file.
If the dataset contains multivariate mappings then a method for determining the category keys must be specified in the configuration file.
Alternatively Cantabular can be configured to generate random category keys.
The category key file supplied with the example microdata dataset is shown below.
example/src/microdata/catkeys.csv
:
variable name,category code,category key
city,0,23898
city,1,53015
city,2,10564
sex,0,60090
sex,1,50310
siblings,0,7126
siblings,1,20954
siblings,2,27394
siblings,3,2003
siblings,4,27061
siblings,5,18056
siblings,6,36261
Structural zeros variables¶
An array consisting of the names of rule variables to use as structural zeros variables when performing zero perturbation. The entries are processed in order. Each source of a specified variable uses that variable as the structural zeros variable unless it has already been assigned to a previous entry in the array. Queries on any sources of these variables have zero perturbation applied. An equivalent query using the appropriate structural zeros variable is used to determine zeros that should not be perturbed.
A base variable cannot be used as a structural zeros variables. Nor can a variable that is a source of another structural zeros variable.
Percentage of ones perturbed to zero¶
The percentage of ones perturbed to zero is specified in the configuration file. This is the percentage of cells with unperturbed value of one at the structural zeros variable perturbed down to zero to match the number of zeros perturbed up. This thus determines number of zeros perturbed for a query. The default is 29.999000%.