Functional Reproducibility



robsteranium.github.io/functional-reproducibility

robin@infonomics.ltd.uk

Reproducible adj.
(of a measurement, experiment etc) capable of being reproduced at a different time or place and by different people.

Increasing Disorder CC BY Htkym

Maxwell’s Demon CC BY Htkym

\[f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

plot(dnorm, -3, 3, main="The Standard Normal Distribution")

theregister.com/2016/03/23/npm_left_pad_chaos/

CC-BY-NC 2.5 xkcd.com/2347/

Engineering reproducibility
in the face of entropy

Isn’t code reproducible?

Reading from a database:

users <- DBI::dbReadTable(connection, "users")

Drawing a sample at random:

coin_tosses <- sample(c("heads","tails"), 10, replace = TRUE)

Writing to a filesystem:

readr::write_csv(data, "output.csv")

None of them are reproducible!

Pure functions are reproducible

D Input Input Function Function Input->Function Output Output Function->Output

Side-effects aren’t reproducible

D Function Function SideEffect Side Effect Function->SideEffect Output Output Function->Output SideEffect->Function Input Input Input->Function

Non-local state makes functions sensitive to context

counter <- 0

show_count <- function() {
  cat("Count is", counter) # concatenate and print
}

show_count()
Count is 0

Sometime later…

counter <- counter + 1

show_count()
Count is 1

Explicit inputs/ outputs let us separate code and context

counter <- 0

describe_count <- function(count) {
  paste("Count is", count) # string interpolation
}

print(describe_count(counter))
[1] "Count is 0"

Sometime later…

counter <- counter + 1

print(describe_count(counter))
[1] "Count is 1"

The context isn’t always apparent

toss_coins <- function(n) sample(c("heads","tails"), n, replace = TRUE)

toss_coins(5)
[1] "heads" "tails" "tails" "heads" "heads"

Sometime later…

toss_coins(5)
[1] "tails" "tails" "heads" "tails" "heads"

We can make the context explicit

set.seed(1234) # set state deterministically

toss_coins(5)
[1] "tails" "tails" "tails" "tails" "heads"

Sometime later…

set.seed(1234) # reset the state again

toss_coins(5)
[1] "tails" "tails" "tails" "tails" "heads"

I/O is side-effecting

Input is a side-effect

df <- readr::read_csv("/home/robin/data-290224-final.csv")

Output is a side-effect

readr::write_csv(result, "~/results/output.csv")

Dependency-injection is explicit

library(DBI)

my_conn <- dbConnect(...)

Instead of relying on global state:

get_data <- function() {
  dbReadTable(my_conn, "users")
}

get_data()

We can make dependencies explicit:

get_data <- function(connection) {
  dbReadTable(connection, "users")
}

get_data(my_conn)

Execution context is explicit

Command-line arguments

R -f pipeline.r input.csv output.parquet

Environment variables

env DB="http://user:pw@localhost:1337" R -f pipeline.r

Configuration data

R -f pipeline.r configuration.yaml

Functional pipeline,
configured context

D Function1 Function1 Function2 Function2 Function1->Function2 Input Input Input->Function1 Function3 Function3 Output Output Function3->Output Function2->Function3

Lessons from software development

Versions are values over time

It is impossible to step in the same river twice

Heraclitus c.a. 500 BC, possibly apocryphal

alisonhorst/palmerpenguins v0.1.0

s3://alisonhorst/palmerpenguins.csv?versionId=c19a904

Automation proves reproducibility

Builds should be:

  • deterministic
  • automated
  • ephemeral

Continuous Everything

Continuous integration
run, test and package code
Continuous training
fit parameters or run experiments
Continuous delivery
deploy models and applications for inference
Continuous evaluation
calculate and monitor performance metrics

Notebooks as an anti-pattern

Read-Evaluate-Print-Loop

Engineering reproducibility

  • Decay is a physical inevitability but maths is eternal
  • Pure functions on immutable data are reproducible
  • Keep the core of your data flow as pure as possible
  • Extract side-effects and inject dependencies explicitly
  • Use version control for code, models and data
  • Automate workflows for continuous development

Functional Reproducibility



robsteranium.github.io/functional-reproducibility

robin@infonomics.ltd.uk