Codebook¶
Format¶
The codebook is fully described in “Cantabular Codebook Format” document. In brief, its purpose is:
To describe the coded variables and categories in natural language, e.g. English * for the user to select the variables and categories they are interested in;
for presentation to the user in tabular output.
To describe recategorizations of the data which the user may select
e.g. data may contain ‘age in single years’, but user may want output cross-tabulation by ‘age in 10-year bands’;
these recategorizations are referred to in Cantabular documentation as “mappings” but may also be called:
“mapped variables”;
“target variables”;
“derived variables”;
(in this case, ‘age in single years’ would be an example of a “source variable”).
Take the example CSV data above with "city, age, sex"
. The codebook needs to describe these three variables and
their categories. The format has one file per variable.
Errors¶
A limited number of errors are reported when processing the each codebook variable or mapping file
to avoid generating a very large number of output messages. You can control the limit by setting
the CANTABULAR_MAX_CODEBOOK_FILE_ERRORS
environment variable to the maximum number of errors
that you wish to display.
Example codebook¶
The codebook files are put together in a zip file or a directory. Each variable has its own CSV file describing each valid category for the variable. Additionally, there is an “index” CSV file describing human readable names for each variable, and there may be mapping files describing recategorizations of source data.
The following files constitute the codebook for the example datasets supplied with the Cantabular software release.
Index file¶
example/codebook/codebook.csv
:
variable name,variable label
city,City
country,Country
health,Health
sex,Sex
siblings,Number of siblings
siblings_3,Number of siblings (3 mappings)
earnings,Earnings
Variable files for city, sex and siblings¶
example/codebook/city.csv
:
city code,city label
0,London
1,Liverpool
2,Belfast
example/codebook/sex.csv
:
sex code,sex label
0,Male
1,Female
example/codebook/siblings.csv
:
siblings code,siblings label
0,No siblings
1,1 sibling
2,2 siblings
...
5,5 siblings
6,6 or more siblings
Note that the "..."
notation causes Cantabular to generate all the intervening categories.
example/codebook/health.csv
:
health code,health label
1,Very good health
2,Good health
3,Fair health
4,Bad health
5,Very bad health
Variable file for mappings¶
siblings_3
is a mapping of siblings
. The microdata does not contain a column for siblings_3
. The
variable file for a mapping has the same structure as that for any other variable.
example/codebook/siblings_3.csv
:
siblings_3 code,siblings_3 label
0,No siblings
1-2,1 or 2 siblings
3+,3 or more siblings
Rule variables¶
country
is a mapping of city
. These two variables are rule variables with city
as the
rule base variable.
example/codebook/country.csv
:
country code,country label
E,England
N,Northern Ireland
Disclosure control rules are applied independently to each category in a rule variable. The rule base variable is the variable from which all other rule variables are derived. There can only be one rule base variable, but multiple rule variables can be directly derived from any other rule variable including the rule base variable.
Mapping files¶
This file specifies the mapping of source codes to the mapping codes. The following file
shows how categories in siblings
are mapped to categories in siblings_3
:
example/codebook/siblings_3.mapping.csv
:
siblings code,siblings_3 code
0,0
1>2,1-2
3>6,3+
The greater than sign (>
) indicates a range or codes in the original variable.
The equivalent file for country
shows how categories in city
are mapped to categories in
country
:
example/codebook/country.mapping.csv
:
city code,country code
0>1,E
2,N