Microdata input¶
Format¶
A CSV file, one record per row, with an arbitrary number of columns.
Each column represents categorical data.
A record may have a weight specified, so may represent a fraction of an individual or multiple individuals.
Microdata can be compressed using gzip or bzip2. Cantabular automatically decompresses files ending in
.gz
or.bz2
as it reads them. The filename in the dataset configuration file should include any extension indicating a compressed format.Using gzip compressed microdata is recommended and typically results in a faster dataset build. This is because the decompression and CSV parsing are done in parallel (assuming you have multiple cores, as is likely) and there is substantially less file system access.
Using bzip2 will compress more but decompression is slower and compression is very slow. This is only recommended if you have some hard file size constraints.
A CSV header is required. It provides machine readable names for the columns of data.
Column names must be unique when compared case-insensitively.
Typically each column of data is represented by short alphanumeric codes for reasons of space efficiency and easy amendment of human readable descriptions.
As an example, one column may contain values of
M
andF
whereM
stands for “Male” andF
stands for “Female”.
Special columns (names of these columns can be specified in configuration):
- Row keys
Used to determine the “cell key” for the cell key method. The row keys for all rows contributing to a single cell are summed modulo 256 in order to produce an effectively random number used to choose the perturbation for the cell.
- Row weight
Used to determine the contribution of an individual row to the frequency count of a single cell in output table.
Row keys may be omitted in which case they are generated by cantabular-make-dataset
using a cryptographically
secure pseudo random number generator seeded from a secure hash of the microdata.
Row weights, if omitted, are assumed to be "1"
for each row. If specified, they should be specified in a
decimal format (e.g 0.125
).
Errors¶
A limited number of errors are reported when processing the microdata file to avoid
generating a very large number of output messages. You can control the limit by setting
the CANTABULAR_MAX_MICRODATA_ERRORS
environment variable to the maximum number of errors
that you wish to display.
Example microdata¶
The example microdata dataset contains a small microdata file.
example/src/microdata/microdata.csv
:
city,sex,siblings,RKEY
0,0,0,27
1,1,1,255
2,0,2,21
0,1,3,1
1,0,4,43
2,1,5,66
2,1,6,156
In the above example, there are 7 persons.
The first person is in the city of London (code 0
),
is a Male (code 0
),
has 0 siblings and
has a RKEY (row key) value of 27.
The meaning of all the codes is specified in the codebook and is discussed in the next section.