Specification¶
The “codebook” includes both variables and mappings, but not category keys which are separate for reasons of security.
Points common to the whole codebook:
The codebook is a set of files in a directory or the same files in a single ZIP file.
There is an “index” file which is the start point for referencing all the other files.
Files not referenced by the index file are ignored.
Files are in CSV format encoded in UTF-8. Blank lines are ignored. The CSV specification is RFC 4180. See also https://golang.org/pkg/encoding/csv/ for a pithy description.
All CSV files must have a header line as described.
Full stop is used as a separator between variable names. If you need to use a full stop in a variable name then double it up in the file name, e.g.
var..name.base..name.mapping.csv
Index file¶
The index file is by default called: codebook.csv
An alternate name for the index file can be specified via the Cantabular configuration file.
It is a CSV file with one line per variable or mapping and the following header:
variable name,variable label
The first field (name) is the variable or mapping name used in the API. The second field (label) is the variable name displayed to the user.
The label may be omitted if it is the same as name, but the comma is still required: each line must contain two fields if the header has two fields.
If all labels are the same as the names then you can alter the header to have only one field, thus:
variable name
In this case then each line must have only one field containing the variable name.
Both variable names and labels must be unique when compared case-insensitively. This is to avoid confusion for users when making queries and processing output.
Variable file¶
Variable files are named after the variable name with a CSV extension,
all in lowercase: varname.csv
These files have the following header (where varname is replaced with the actual variable name):
varname code,varname label
varname
” must match the filename prefix, but can be in a different case.code,label
.If all labels are the same as the codes then you can alter the header to have only one field, thus:
varname code
In this case each line must have only one field containing the code.
Cantabular can also generate ranges of categories
using a line containing just “...
” thus:
startcode,startlabel
...
endcode,endlabel
All of the startcode, startlabel, endcode and endlabel
must contain a single sequence of digits.
The integer formed from these digits will be incremented to generate
a sequence of codes and values to replace the “...
”.
The number must be the same in the startcode and startlabel.
Similarly the number must be the same in endcode and endlabel.
There must be a difference of at least two between the start and end numbers.
For example you could generate an age variable thus:
AGE code,AGE label
0,Aged 0 years
...
79,Aged 79 years
80,Aged 80-89 years
90,Aged 90 or over
Mapping file¶
A mapping is a variable which is not present in the microdata. Instead each category in the variable is mapped from a number of categories in another variable (which can itself be a mapping).
Mappings can be used to group categories together, e.g. for creating age ranges or bands. Cross tabulations using grouped mappings result in larger counts in cells. Such tables are less likely to be disclosive.
Each mapping has a variable file as above together with a separate mapping file with the file name format:
varname.mapping*.csv
*
is a wildcard that matches any string. Some examples of valid mapping file names are:
varname.mapping.csv
varname.mapping1.csv
varname.mapping.1.csv
varname.mapping.source.csv
It has the following header:
srcvarname code,varname code
srcvarname
is the name of the variable from which this mapping is derived.varname
is the mapping variable name and must match the filename prefix.Each line has the form:
srccode,mapcode
srccode
is a code from the source variable file.mapcode
is a code from the mapping variable file.srccode
can also be a range: code1>code2
srcvarname.csv
file.srccode
can be *
(unicode ASTERISK: U+002A
) which is used to specify a default mapping code.mapcode
associated with *
. It can only occur once.U+003E: >
).If a code string includes a GREATER-THAN SIGN
then it must be quoted by preceding with a
\
(REVERSE SOLIDUS: U+005C
).
Any instances of REVERSE SOLIDUS
must also be quoted by doubling up thus: \\
A mapping with a single source must not have more categories than its source variable. A multivariate mapping must not have more categories than the product of the number of categories in all source variables.
Multivariate mapping file¶
A multivariate mapping with 2 sources has a header of:
srcvarname1 code,srcvarname2 code,varname code
Mappings with more than 2 sources will have additional columns for each extra source variable. For instance a multivariate mapping with 3 sources has a header of:
srcvarname1 code,srcvarname2 code,srcvarname3 code,varname code
The default mapping entry for a multivariate mapping has a *
in the column for each
source variable. The following line would set the default mapping for a multivariate mapping
with 2 sources to Other
:
*,*,Other
Other field meanings are the same as for a mapping file.
Multiple mapping files for a variable¶
Variables can be independently derived from different sources that share a common base
variable. For instance a variable Country
might be independently mapped from Region
and Electoral District
. This allows complex hierarchies to be constructed.
In such cases the variable will have multiple mapping files with it, one for each independent mapping.
Each mapping file must have a unique name in the format already outlined. For instance a mapping
called varname
that has two independent sources might have mapping files with these names:
varname.mapping.src1.csv
varname.mapping.src2.csv
Multiple mappings can only be added if the following criteria are met:
The mapping and all sources are ultimately derived from the same base variable. That can be a source data variable or a multivariate mapping.
Each mapping must have a single immediate source variable.
Each category in the mapping variable must map to the same set of categories in the base variable.
The immediate source variable for a mapping cannot be mapped from any other source, either directly or via intermediate mappings. Consider the situation where
Zone
is the rule base variable,Region
is mapped fromZone
andCountry
is mapped fromRegion
. A second mapping cannot be specified withCountry
mapped directly fromZone
. IfElectoral District
is mapped fromZone
, thenCountry
could have mappings specified from bothRegion
andElectoral District
.
Incomplete mappings¶
Mappings that are rule variables can also be incomplete mappings. These are mappings where some categories in a source variable are not mapped to any category in the mapping variable.
They are used in situations where a rule variable does not cover the entire population. For
instance if Zone
is the rule base variable, and City
is derived from Zone
then
every Zone
category may not be mapped to a City
category.
Source categories are marked as being unmapped by specifying mapcode
as an empty string
in the mapping file. For instance the following is the content of a file that maps Zone
to City
:
Zone code,City code
Zone_001,City_001
Zone_002>Zone_003,City_002
Zone_004,
Zone_005,City_003
Zone_006,
The categories Zone_004
and Zone_006
are not mapped to any category in City
.
Note that the commas after Zone_004
and Zone_006
are required for this file to be
interpreted correctly.
An empty string can also be specified as the default mapping code, so the mapping file for
City
could also be written as shown below, which has the effect of marking any unspecified
categories in Zone
as unmapped:
Zone code,City code
Zone_001,City_001
Zone_002>Zone_003,City_002
Zone_005,City_003
*,
When an incomplete mapping is used in a query, records that belong to unmapped categories will
not be included in the output table. Using the above example, a query on City
would not count
any records with a Zone
value of Zone_004
or Zone_006
.