Codebook

Format

The codebook is fully described in “Cantabular Codebook Format” document. In brief, its purpose is:

  • To describe the coded variables and categories in natural language, e.g. English * for the user to select the variables and categories they are interested in;

    • for presentation to the user in tabular output.

  • To describe recategorizations of the data which the user may select

    • e.g. data may contain ‘age in single years’, but user may want output cross-tabulation by ‘age in 10-year bands’;

    • these recategorizations are referred to in Cantabular documentation as “mappings” but may also be called:

      • “mapped variables”;

      • “target variables”;

      • “derived variables”;

    • (in this case, ‘age in single years’ would be an example of a “source variable”).

Take the example CSV data above with "city, age, sex". The codebook needs to describe these three variables and their categories. The format has one file per variable.

Errors

A limited number of errors are reported when processing the each codebook variable or mapping file to avoid generating a very large number of output messages. You can control the limit by setting the CANTABULAR_MAX_CODEBOOK_FILE_ERRORS environment variable to the maximum number of errors that you wish to display.

Example codebook

The codebook files are put together in a zip file or a directory. Each variable has its own CSV file describing each valid category for the variable. Additionally, there is an “index” CSV file describing human readable names for each variable, and there may be mapping files describing recategorizations of source data.

The following files constitute the codebook for the example datasets supplied with the Cantabular software release.

Index file

example/codebook/codebook.csv:

variable name,variable label
city,City
country,Country
health,Health
sex,Sex
siblings,Number of siblings
siblings_3,Number of siblings (3 mappings)
earnings,Earnings

Variable files for city, sex and siblings

example/codebook/city.csv:

city code,city label
0,London
1,Liverpool
2,Belfast

example/codebook/sex.csv:

sex code,sex label
0,Male
1,Female

example/codebook/siblings.csv:

siblings code,siblings label
0,No siblings
1,1 sibling
2,2 siblings
...
5,5 siblings
6,6 or more siblings

Note that the "..." notation causes Cantabular to generate all the intervening categories.

example/codebook/health.csv:

health code,health label
1,Very good health
2,Good health
3,Fair health
4,Bad health
5,Very bad health

Variable file for mappings

siblings_3 is a mapping of siblings. The microdata does not contain a column for siblings_3. The variable file for a mapping has the same structure as that for any other variable.

example/codebook/siblings_3.csv:

siblings_3 code,siblings_3 label
0,No siblings
1-2,1 or 2 siblings
3+,3 or more siblings

Rule variables

country is a mapping of city. These two variables are rule variables with city as the rule base variable.

example/codebook/country.csv:

country code,country label
E,England
N,Northern Ireland

Disclosure control rules are applied independently to each category in a rule variable. The rule base variable is the variable from which all other rule variables are derived. There can only be one rule base variable, but multiple rule variables can be directly derived from any other rule variable including the rule base variable.

Mapping files

This file specifies the mapping of source codes to the mapping codes. The following file shows how categories in siblings are mapped to categories in siblings_3:

example/codebook/siblings_3.mapping.csv:

siblings code,siblings_3 code
0,0
1>2,1-2
3>6,3+

The greater than sign (>) indicates a range or codes in the original variable.

The equivalent file for country shows how categories in city are mapped to categories in country:

example/codebook/country.mapping.csv:

city code,country code
0>1,E
2,N