Functional Reproducibility



robsteranium.github.io/functional-reproducibility

robin@infonomics.ltd.uk

robingower.com

Reproducible adj.
(of a measurement, experiment etc) capable of being reproduced at a different time or place and by different people.

Increasing Disorder CC BY Htkym

Maxwell’s Demon CC BY Htkym

\[f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

theregister.com/2016/03/23/npm_left_pad_chaos/

CC-BY-NC 2.5 xkcd.com/2347/

Engineering reproducibility
in the face of entropy

Isn’t code reproducible?

Reading from a database:

users = pd.read_sql_table('users', connection)
users <- DBI::dbReadTable(connection, "users")

Drawing a sample at random:

coin_tosses = random.choices(['heads','tails'], 10)
coin_tosses <- sample(c("heads","tails"), 10, replace = TRUE)

Writing to a filesystem:

pd.to_csv(data, 'output.csv')
readr::write_csv(data, "output.csv")

None of them are reproducible!

Pure functions are reproducible

D Input Input Function Function Input->Function Output Output Function->Output

Side-effects aren’t reproducible

D Function Function SideEffect Side Effect Function->SideEffect Output Output Function->Output SideEffect->Function Input Input Input->Function

Non-local state makes functions sensitive to context

counter = 0

def show_count():
  print(f'Count is {counter}')

show_count()
Count is 0
counter <- 0

show_count <- function() {
  cat("Count is", counter) # concatenate and print
}

show_count()
Count is 0

Sometime later…

counter = counter + 1

show_count() # now returns a different value
Count is 1
counter <- counter + 1

show_count() # now returns a different value
Count is 1

Explicit inputs/ outputs let us separate code and context

counter = 0

def describe_count(count):
  return f'Count if {count}' # interpolation

print(describe_count(counter)) # rendering
Count if 0
counter <- 0

describe_count <- function(count) {
  paste("Count is", count) # string interpolation
}

print(describe_count(counter))
[1] "Count is 0"

Sometime later…

counter = counter + 1

print(describe_count(counter)) # receives a different value
Count if 1
counter <- counter + 1

print(describe_count(counter)) # receives a different value
[1] "Count is 1"

The context isn’t always apparent

import random

def toss_coins(n):
    return random.choices(['heads','tails'], k=n)

toss_coins(5)
['tails', 'tails', 'heads', 'heads', 'heads']
toss_coins <- function(n) sample(c("heads","tails"), n, replace = TRUE)

toss_coins(5)
[1] "heads" "tails" "tails" "heads" "heads"

Sometime later…

toss_coins(5) # result differs
['tails', 'tails', 'heads', 'heads', 'heads']
toss_coins(5) # result differs
[1] "tails" "tails" "tails" "tails" "heads"

We can make the context explicit

random.seed(1234) # set state deterministically

toss_coins(5)
['tails', 'heads', 'heads', 'tails', 'tails']
set.seed(1234) # set state deterministically

toss_coins(5)
[1] "tails" "tails" "tails" "tails" "heads"

Sometime later…

random.seed(1234) # reset the state again

toss_coins(5) # result matches
['tails', 'heads', 'heads', 'tails', 'tails']
set.seed(1234) # reset the state again

toss_coins(5) # result matches
[1] "tails" "tails" "tails" "tails" "heads"

I/O is side-effecting

Input is a side-effect

df = pandas.read_csv('/home/robin/data-290224-final.csv')
df <- readr::read_csv("/home/robin/data-290224-final.csv")

Output is a side-effect

result.to_csv('~/results/output.csv')
readr::write_csv(result, "~/results/output.csv")

Dependency-injection is explicit

import sqlalchemy

my_engine = create_engine(...)
my_conn = my_engine.connect()
library(DBI)

my_conn <- dbConnect(...)

Instead of relying on global state:

def users:
  return pd.read_sql_table('users', my_conn)

users()
users <- function() {
  dbReadTable(my_conn, "users")
}

users()

We can make dependencies explicit:

def users(connection):
  return pd.read_sql_table('users', connection)

users(my_conn)
users <- function(connection) {
  dbReadTable(connection, "users")
}

users(my_conn)

Execution context is explicit

Command-line arguments

python -m pipeline input.csv output.parquet
R -f pipeline.r input.csv output.parquet

Environment variables

env DB="http://user:pw@localhost:1337" python -m pipeline
env DB="http://user:pw@localhost:1337" R -f pipeline.r

Configuration data

python -m pipeline configuration.yaml
R -f pipeline.r configuration.yaml

Functional pipeline,
configured context

D Function1 Function1 Function2 Function2 Function1->Function2 Input Input Input->Function1 Function3 Function3 Output Output Function3->Output Function2->Function3

Lessons from software development

Versions are values over time

It is impossible to step in the same river twice

Heraclitus c.a. 500 BC, possibly apocryphal

alisonhorst/palmerpenguins v0.1.0

s3://alisonhorst/palmerpenguins.csv?versionId=c19a904

Automation proves reproducibility

Builds should be:

  • deterministic
  • automated
  • ephemeral

Continuous Everything

Continuous integration
run, test and package code
Continuous training
fit parameters or run experiments
Continuous delivery
deploy models and applications for inference
Continuous evaluation
calculate and monitor performance metrics

Notebooks as an anti-pattern

Read-Evaluate-Print-Loop

Engineering reproducibility

  • Decay is a physical inevitability but maths is eternal
  • Pure functions on immutable data are reproducible
  • Keep the core of your data flow as pure as possible
  • Extract side-effects and inject dependencies explicitly
  • Use version control for code, models and data
  • Automate workflows for continuous development

Functional Reproducibility



robsteranium.github.io/functional-reproducibility

robin@infonomics.ltd.uk

robingower.com