Improve your data

Lessons learned from analysis, explained with the science of information entropy

These slides were made with reveal.js. Watch out the slides go up and down as well as left and right! You can navigate with arrow keys or hit escape for the overview.

The talk was originally presented at Open Data Manchester on 27/05/2014. A more in depth discussion of these ideas can be found on the Infonomics blog.

Robin Gower / @robsteranium

What makes data good?

TBL's 5 Star Scheme

Increase Information Entropy

Uncertainty - range of possible states

Data resolves uncertainty

Quantity

Clarity

Novelty

More is better than less

While it is (comparatively) easy to ignore irrelevant or useless data, it is impossible to consider data that you don't have.
If it's easy enough to share everything then do so. Bandwidth is cheap and it's relatively straightforward to filter data.
Those analysing your data may have a different perspective on what's useful - you don't know what they don't know.
This may be inefficient, particularly if the receiver is already in possession of the data you're sending.
Where your data set includes data from a third party it may be better to provide a linking index to that data, rather than to replicate it wholesale.
Indeed even if the data you have available to release is small, it may be made larger through linking it to other sources.

Link with code(lists)

BBC - MusicBrainz

http://www.bbc.co.uk/music/artists/5441c29d-3602-4898-b1a1-b77fa23b8e50

Positive network effects

Reference data

MECE principle

Mutually exclusive - no overlaps

Collectively exhaustive - no gaps

Mutually Exclusive, Collectively Exhaustive

Normalise

Efficiency

Reliability

Integrity

Here I'm referring to [database normalisation](http://en.wikipedia.org/wiki/Database_normalization), rather than [statistical normalisation](http://en.wikipedia.org/wiki/Normalization_%28statistics%29).
If you have a table with two or more rows that need to be changed at the same time (because in some place they're referring to the same thing) then some normalisation is required.
A normalised database is one with a minimum redundancy - the same data isn't repeated in multiple places.
Look-up tables are used, for example, so that a categorical variable doesn't need to have it's categories repeated (and possibly misspelled).
Database normalisation ensures integrity (otherwise if two things purporting to be the same are different then how do you know which one is right?) and efficiency (repetition is waste).

Be Precise, allow user to simplify

There's no such thing as unsimplification

Don't categorise continuous variables unless you can't help it

If you have to categorise, do it after data has been collected

Categories that do not divide a continuous dimension evenly are also problematic.

This is particularly common in survey data, where respondents are presented with a closed-list of intervals as options, rather than being asked to provide an estimate of the value itself.

The result is often that the majority of responses fall into one category, with few in the others.

Presenting a closed-list of options is sometimes to be prefered for other reasons (e.g. in questions about income, categories might ellicit more responses) - if so the bounds should be chosen with reference to the expected frequencies of responses not the linear scale of the dimension (i.e. the categories should have similar numbers of observations in them, not occupy similar sized intervals along the range of the variable being categorised).

Once you've lost precision, you can't get it back

Represent Nothingness Accurately

Not available

Null

Zero

Provide Metadata

Increase Information Entropy

Open more data to resolve more uncertainty

Duplication leads to uncertainty

Normalised data - same variety, but smaller/ clearer

Precise data - more possible states

Accurate nothingness - don't leave questions

Metadata makes your data more certain

Don't interpret or summarise

These tips are all related to a general principle of increasing entropy. As explained above, [Information entropy](http://en.wikipedia.org/wiki/Entropy_%28information_theory%29) is a measure of the expected value of a message. It is higher when that message (once delivered) is able to resolve more uncertainty. That is to say, that the message is able to say more things, more clearly, that are novel to the recipient.

More data, whether in the original release or in the other sources that may be linked to it, means more variety, which means more uncertainty can be resolved, and thus more value provided.
Duplication (and thus the potential for inconsistency) in the message means that it doesn't resolve uncertainty, and thus doesn't add value.
Normalised data retains the same variety in a smaller, clearer message.
Precise data can take on more possible values and thus clarify more uncertainty than codified data.
Inaccurately represented nothingness also means that the message isn't able to resolve uncertainty (about which type of nothing applies).
Metadata makes the recipient more certain about the content of your data

Herein lies a counter-intuitive aspect of releasing data. It seems to be sensible to reduce variety and uncertainty in the data, to make sense and interpret the raw data before it is presented. To provide more rather than less ordered data. In fact such actions make the data less informative, and make it harder to re-interpret the data in a wider range of contexts. Indeed much of the impetus behind Big Data is the recognition that unstructured, raw data has immense information potential. It is the capacity for re-interpretation that makes data valuable.

Improve your data

Lessons learned from analysis, explained with the science of information entropy

The talk was originally presented at Open Data Manchester on 27/05/2014. A more in depth discussion of these ideas can be found on the Infonomics blog.

Robin Gower / @robsteranium