Taming Data with Science

Lessons learned from analysis
explained with information theory

Robin Gower, Equal Experts

Do I need to
bring a coat?

What's the weather like?


Location: Berlin
Date: 2019-09-24
Temperature: 15°

Is it cold in Berlin today?


Yes, you'll need a coat

Should I take a coat?

✓

Big Data !=
Big Information

Why should you care about taming your data?

Understand what information is contained in your data

Discover what's really valuable about your data

Learn how to structure and organise it effectively

Structure of the talk

Theory: Information entropy
Practice: Tame your data for analysis

Information Entropy

Communication Theory

Shannon's communication system — Shannon (1948) A Mathematical Theory of Communication

Source
↓
Encoder
↓
Symbol
↓
Decoder
↓
Destination

Freedom of expression

Measurable quantity

Thirst for knowledge

Source
↓
Encoder
↓
Symbol
↓
Decoder
↓
Destination

Unobserved choices

Measure of information entropy

Resolved uncertainty

So how do we measure choice or uncertainty?

What's more surprising?

Correctly guessing the result of:

tossing a coin or
rolling a dice?

How can knowing the distribution of responses reduce the amount of data we need to communicate?

Sending one toss at a time

Outcome Probabilities:$$\begin{align} P(H) &= 3/4 \\ P(T) &= 1/4\\ \end{align}$$

Simple binary coding:$$\begin{align} H &= 0 \\ T &= 1\\ \end{align}$$

$$\begin{align} \text{Average bits per toss} = & P(H) \times 1 \text{ bit} + \\ & P(T) \times 1 \text{ bit} \\ = & 1 \text{ bit} \end{align}$$

Block encoding (two tosses)

Outcome Probabilities:$$\begin{align} P(HH) &= 9/16 \\ P(HT) &= 3/16 \\ P(TH) &= 3/16 \\ P(TT) &= 1/16 \\ \end{align}$$

Huffman coding:$$\begin{align} HH &= 0 \\ HT &= 10 \\ TH &= 110 \\ TT &= 111 \\ \end{align}$$

$$\begin{align} \text{Average bits per sequence} = & P(HH) \times 1 \text{ bit } + P(HT) \times 2 \text{ bits } + \\ & P(TH) \times 3 \text{ bits } + P(TT) \times 3 \text{ bits} \\ = & 1.6875 \text{ bits} \\ \text{Average bits per toss} = & 1.6875/2 \\ = & 0.84375 \text{ bits} \\ \end{align}$$

Information Entropy (motivation)

Tells us the theoretical limit for compression

The irreducible information content of a source/ variable

A way of thinking about how valuable data is

Information Content

Bits of information required to communicate an outcome$$\begin{aligned} I(\text{outcome}) &= log_2 \frac{1}{P(\text{outcome})} \\ &= - log_2 P(\text{outcome}) \end{aligned}$$

Information Entropy

Average bits of information per outcome$$H(\text{source}) = \sum_{\text{outcome} \in \text{source}} {-P(\text{outcome}) \log_2 {P(\text{outcome})}}$$

Information Entropy (interpretation)

more information requires more data
more data doesn't mean more information
rare things are more informative
data is part of a conversation
add value by resolving uncertainty

More information
requires
more data

Retain precision

Keep more decimal places than you need to show

Don't convert continuous variables (numbers) to discrete ones (categories)

You can't unsimplify

https://bamos.github.io/2016/08/09/deep-completion/

More data doesn't
mean more information

Normalisation

A way to structure your data to reduce redundancy

1NF: one value per cell

Show	Actor
Brooklyn 99	Stephanie Beatriz, Terry Crews

Show	Actor
Brooklyn 99	Stephanie Beatriz
Brooklyn 99	Terry Crews

2NF: cells depend on the whole key

Show	Rating	Language	Subtitles
Rick and Morty	4	English	Available
Rick and Morty	4	German	Unavailable

Show	Rating
Rick and Morty	4

Show	Language	Subtitles
Rick and Morty	English	Available
Rick and Morty	German	Unavailable

3NF: cells only depend on the key

Show	Genre	Sub-Genre
Doctor Who	Fiction	Sci-Fi

Show	Sub-Genre
Doctor Who	Sci-Fi

Sub-Genre	Genre
Sci-Fi	Fiction

What data scientists want...

Tidy Data

Example ID	Feature A	Feature B	Feature C	Classification
...	...	...	...	...
...	...	...	...	...
...	...	...	...	...

Orthongonality

Informative variables correlate with the objective
but not with each other

MECE

Rarity is
informative

Visualise your probability distributions

Polish Central Examination Board Matura Test Scores 2013 cke.edu.pl

Anscombe (1973) Graphs in Statistical Analysis reproduced in Anthony (2012) Qlik Design Blog

Data is part of
a conversation

Start by defining the question

Defining the Question...

What decision needs to be taken?

If in doubt, follow the money!

Incoporate Context

e.g. Unitless measures

reddit.com/u/UnrequitedReason on /r/DataIsBeautiful

Provide Metadata

Describe and explain your data

Aid discovery and interpretation

Track provenance

Handles to manipulate your data

Don't put data in the keys


{
  "2017":  8000,
  "2018": 10000,
  "2019": 15000
}

Data frames are easier to manipulate


[
  { "date": "2017", "pageviews":  8000 },
  { "date": "2018", "pageviews": 10000 },
  { "date": "2019", "pageviews": 15000 }
]

Add value by
resolving
uncertainty

Information Entropy

more information requires more data - retain precision, link-up data
more data doesn't mean more information - normalised tidy-data, mece, orthogonality
rare things are more informative - visualise distributions
data is part of a conversation - start by defining the question/ decision, incorporate context, use metadata
add value by resolving uncertainty

Taming Data with Science

github.com/robsteranium/tame-data-with-science

rgower@equalexperts.com

@robsteranium

Taming Data with Science

Robin Gower, Equal Experts

Do I need tobring a coat?

What's the weather like?

Is it cold in Berlin today?

Should I take a coat?

Big Data !=Big Information

Why should you care about taming your data?

Structure of the talk

Information Entropy

Communication Theory

What's more surprising?

How can knowing the distribution of responses reduce the amount of data we need to communicate?

Sending one toss at a time

Block encoding (two tosses)

Information Entropy (motivation)

Information Content

Information Entropy

Information Entropy (interpretation)

Retain precision

Normalisation

1NF: one value per cell

2NF: cells depend on the whole key

3NF: cells only depend on the key

Tidy Data

Orthongonality

MECE

Rarity isinformative

Visualise your probability distributions

Data is part ofa conversation

Start by defining the question

What decision needs to be taken?

Incoporate Context

Provide Metadata

Don't put data in the keys

Data frames are easier to manipulate

Add value byresolvinguncertainty

Information Entropy

Taming Data with Science

Do I need to
bring a coat?

Big Data !=
Big Information

Rarity is
informative

Data is part of
a conversation

Add value by
resolving
uncertainty