Taming Data with Science

Lessons learned from analysis, explained with information theory

Information Entropy is a way of measuring data in terms of the amount of uncertainty it resolves. We'll use this perspective to explore techniques for structuring and analysing your data. You will learn practical ideas for how to extract more value from your data and leave with a framework for understanding the value proposition of data-driven products.

Taming Data with Science

Lessons learned from analysis
explained with information theory



Robin Gower, Equal Experts




Do I need to
bring a coat?

What's the weather like?


Location: Berlin
Date: 2019-09-24
Temperature: 15°

Is it cold in Berlin today?


Yes, you'll need a coat

Should I take a coat?

Big Data !=
Big Information

Why should you care about taming your data?

Understand what information is contained in your data

Discover what's really valuable about your data

Learn how to structure and organise it effectively

Structure of the talk

  1. Theory: Information entropy
  2. Practice: Tame your data for analysis

Information Entropy

Communication Theory

Shannon's communication system

Source

Encoder

Symbol

Decoder

Destination

Source

Encoder

Symbol

Decoder

Destination

Unobserved choices



Measure of information entropy



Resolved uncertainty

So how do we measure choice or uncertainty?

How can knowing the distribution of responses reduce the amount of data we need to communicate?

Sending one toss at a time

Outcome Probabilities:$$\begin{align} P(H) &= 3/4 \\ P(T) &= 1/4\\ \end{align}$$

$$\begin{align} \text{Average bits per toss} = & P(H) \times 1 \text{ bit} + \\ & P(T) \times 1 \text{ bit} \\ = & 1 \text{ bit} \end{align}$$

Block encoding (two tosses)

Outcome Probabilities:$$\begin{align} P(HH) &= 9/16 \\ P(HT) &= 3/16 \\ P(TH) &= 3/16 \\ P(TT) &= 1/16 \\ \end{align}$$

$$\begin{align} \text{Average bits per sequence} = & P(HH) \times 1 \text{ bit } + P(HT) \times 2 \text{ bits } + \\ & P(TH) \times 3 \text{ bits } + P(TT) \times 3 \text{ bits} \\ = & 1.6875 \text{ bits} \\ \text{Average bits per toss} = & 1.6875/2 \\ = & 0.84375 \text{ bits} \\ \end{align}$$

Information Entropy (motivation)

Tells us the theoretical limit for compression

The irreducible information content of a source/ variable

A way of thinking about how valuable data is

Information Content

Bits of information required to communicate an outcome$$\begin{aligned} I(\text{outcome}) &= log_2 \frac{1}{P(\text{outcome})} \\ &= - log_2 P(\text{outcome}) \end{aligned}$$

Information Entropy

Average bits of information per outcome$$H(\text{source}) = \sum_{\text{outcome} \in \text{source}} {-P(\text{outcome}) \log_2 {P(\text{outcome})}}$$

Information Entropy (interpretation)

More information
requires
more data

Retain precision

Keep more decimal places than you need to show

Don't convert continuous variables (numbers) to discrete ones (categories)

You can't unsimplify

More data doesn't
mean more information

Normalisation

A way to structure your data to reduce redundancy

1NF: one value per cell

ShowActor
Brooklyn 99Stephanie Beatriz, Terry Crews
ShowActor
Brooklyn 99Stephanie Beatriz
Brooklyn 99Terry Crews

2NF: cells depend on the whole key

ShowRatingLanguageSubtitles
Rick and Morty4EnglishAvailable
Rick and Morty4GermanUnavailable

3NF: cells only depend on the key


ShowGenreSub-Genre
Doctor WhoFictionSci-Fi

What data scientists want...

Tidy Data

Example IDFeature AFeature BFeature CClassification
...............
...............
...............

Orthongonality

Informative variables correlate with the objective
but not with each other

MECE

Rarity is
informative

Visualise your probability distributions

Data is part of
a conversation

Start by defining the question

Defining the Question...

What decision needs to be taken?

If in doubt, follow the money!

Incoporate Context

e.g. Unitless measures

Provide Metadata

Describe and explain your data

Aid discovery and interpretation

Track provenance

Handles to manipulate your data

Don't put data in the keys


{
  "2017":  8000,
  "2018": 10000,
  "2019": 15000
}

Data frames are easier to manipulate


[
  { "date": "2017", "pageviews":  8000 },
  { "date": "2018", "pageviews": 10000 },
  { "date": "2019", "pageviews": 15000 }
]

Add value by
resolving
uncertainty

Information Entropy

Taming Data with Science

github.com/robsteranium/tame-data-with-science

rgower@equalexperts.com

@robsteranium