Specification

The “codebook” includes both variables and mappings, but not category keys which are separate for reasons of security.

Points common to the whole codebook:

  • The codebook is a set of files in a directory or the same files in a single ZIP file.

  • There is an “index” file which is the start point for referencing all the other files.

  • Files not referenced by the index file are ignored.

  • Files are in CSV format encoded in UTF-8. Blank lines are ignored. The CSV specification is RFC 4180. See also https://golang.org/pkg/encoding/csv/ for a pithy description.

  • All CSV files must have a header line as described.

  • Full stop is used as a separator between variable names. If you need to use a full stop in a variable name then double it up in the file name, e.g. var..name.base..name.mapping.csv

Index file

The index file is by default called: codebook.csv

An alternate name for the index file can be specified via the Cantabular configuration file.

It is a CSV file with one line per variable or mapping and the following header:

variable name,variable label

The first field (name) is the variable or mapping name used in the API. The second field (label) is the variable name displayed to the user.

The label may be omitted if it is the same as name, but the comma is still required: each line must contain two fields if the header has two fields.

If all labels are the same as the names then you can alter the header to have only one field, thus:

variable name

In this case then each line must have only one field containing the variable name.

Both variable names and labels must be unique when compared case-insensitively. This is to avoid confusion for users when making queries and processing output.

Variable file

Variable files are named after the variable name with a CSV extension, all in lowercase: varname.csv

These files have the following header (where varname is replaced with the actual variable name):

varname code,varname label
varname” must match the filename prefix, but can be in a different case.
The following lines are pairs of code,label.
Each line corresponds to one category of the variable.
The first field (code) is the string used in the in the microdata or tabular input CSV.
The second field (label) is the description of the code displayed to the user.
The label may be omitted if it is the same as code, but the comma is still required (each line must contain two fields if the header has two fields).

If all labels are the same as the codes then you can alter the header to have only one field, thus:

varname code

In this case each line must have only one field containing the code.

Cantabular can also generate ranges of categories using a line containing just “...” thus:

startcode,startlabel
...
endcode,endlabel

All of the startcode, startlabel, endcode and endlabel must contain a single sequence of digits. The integer formed from these digits will be incremented to generate a sequence of codes and values to replace the “...”. The number must be the same in the startcode and startlabel. Similarly the number must be the same in endcode and endlabel. There must be a difference of at least two between the start and end numbers.

For example you could generate an age variable thus:

AGE code,AGE label
0,Aged 0 years
...
79,Aged 79 years
80,Aged 80-89 years
90,Aged 90 or over

Mapping file

A mapping is a variable which is not present in the microdata. Instead each category in the variable is mapped from a number of categories in another variable (which can itself be a mapping).

Mappings can be used to group categories together, e.g. for creating age ranges or bands. Cross tabulations using grouped mappings result in larger counts in cells. Such tables are less likely to be disclosive.

Each mapping has a variable file as above together with a separate mapping file with the file name format: varname.mapping*.csv

* is a wildcard that matches any string. Some examples of valid mapping file names are:

varname.mapping.csv
varname.mapping1.csv
varname.mapping.1.csv
varname.mapping.source.csv

It has the following header:

srcvarname code,varname code
srcvarname is the name of the variable from which this mapping is derived.
Note that this can be another mapping.
varname is the mapping variable name and must match the filename prefix.

Each line has the form:

srccode,mapcode
srccode is a code from the source variable file.
mapcode is a code from the mapping variable file.
srccode can also be a range: code1>code2
The range includes any codes in the order specified in the srcvarname.csv file.
srccode can be * (unicode ASTERISK: U+002A) which is used to specify a default mapping code.
Any codes in the source variable that are not explicitly mapped to a code in the mapping variable will
be mapped to the mapcode associated with *. It can only occur once.
The range separator is unicode GREATER-THAN SIGN (U+003E: >).

If a code string includes a GREATER-THAN SIGN then it must be quoted by preceding with a \ (REVERSE SOLIDUS: U+005C). Any instances of REVERSE SOLIDUS must also be quoted by doubling up thus: \\

A mapping with a single source must not have more categories than its source variable. A multivariate mapping must not have more categories than the product of the number of categories in all source variables.

Multivariate mapping file

A multivariate mapping with 2 sources has a header of:

srcvarname1 code,srcvarname2 code,varname code

Mappings with more than 2 sources will have additional columns for each extra source variable. For instance a multivariate mapping with 3 sources has a header of:

srcvarname1 code,srcvarname2 code,srcvarname3 code,varname code

The default mapping entry for a multivariate mapping has a * in the column for each source variable. The following line would set the default mapping for a multivariate mapping with 2 sources to Other:

*,*,Other

Other field meanings are the same as for a mapping file.

Multiple mapping files for a variable

Variables can be independently derived from different sources that share a common base variable. For instance a variable Country might be independently mapped from Region and Electoral District. This allows complex hierarchies to be constructed.

In such cases the variable will have multiple mapping files with it, one for each independent mapping. Each mapping file must have a unique name in the format already outlined. For instance a mapping called varname that has two independent sources might have mapping files with these names:

varname.mapping.src1.csv
varname.mapping.src2.csv

Multiple mappings can only be added if the following criteria are met:

  • The mapping and all sources are ultimately derived from the same base variable. That can be a source data variable or a multivariate mapping.

  • Each mapping must have a single immediate source variable.

  • Each category in the mapping variable must map to the same set of categories in the base variable.

  • The immediate source variable for a mapping cannot be mapped from any other source, either directly or via intermediate mappings. Consider the situation where Zone is the rule base variable, Region is mapped from Zone and Country is mapped from Region. A second mapping cannot be specified with Country mapped directly from Zone. If Electoral District is mapped from Zone, then Country could have mappings specified from both Region and Electoral District.

Incomplete mappings

Mappings that are rule variables can also be incomplete mappings. These are mappings where some categories in a source variable are not mapped to any category in the mapping variable.

They are used in situations where a rule variable does not cover the entire population. For instance if Zone is the rule base variable, and City is derived from Zone then every Zone category may not be mapped to a City category.

Source categories are marked as being unmapped by specifying mapcode as an empty string in the mapping file. For instance the following is the content of a file that maps Zone to City:

Zone code,City code
Zone_001,City_001
Zone_002>Zone_003,City_002
Zone_004,
Zone_005,City_003
Zone_006,

The categories Zone_004 and Zone_006 are not mapped to any category in City. Note that the commas after Zone_004 and Zone_006 are required for this file to be interpreted correctly.

An empty string can also be specified as the default mapping code, so the mapping file for City could also be written as shown below, which has the effect of marking any unspecified categories in Zone as unmapped:

Zone code,City code
Zone_001,City_001
Zone_002>Zone_003,City_002
Zone_005,City_003
*,

When an incomplete mapping is used in a query, records that belong to unmapped categories will not be included in the output table. Using the above example, a query on City would not count any records with a Zone value of Zone_004 or Zone_006.