Functional Reproducibility

robsteranium.github.io/functional-reproducibility

robin@infonomics.ltd.uk

robingower.com

Reproducible adj.: (of a measurement, experiment etc) capable of being reproduced at a different time or place and by different people.

Increasing Disorder CC BY Htkym

Maxwell’s Demon CC BY Htkym

\[f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

theregister.com/2016/03/23/npm_left_pad_chaos/

Engineering reproducibility
in the face of entropy

Isn’t code reproducible?

Reading from a database:

Python
R

users = pd.read_sql_table('users', connection)

users <- DBI::dbReadTable(connection, "users")

Drawing a sample at random:

Python
R

coin_tosses = random.choices(['heads','tails'], 10)

coin_tosses <- sample(c("heads","tails"), 10, replace = TRUE)

Writing to a filesystem:

Python
R

pd.to_csv(data, 'output.csv')

readr::write_csv(data, "output.csv")

None of them are reproducible!

Pure functions are reproducible

Pure functions are reproducible.

The output of a pure function only depends on it’s input. The result doesn’t change if it’s calculated a second time or by another person; for the same input you always get the same output. You could replace a function call with it’s return value in your program. This is known as referential transparency. Indeed you could replace the function body with a lookup table that records the relevant output for each input.

A pure function has no side effects. Running it doesn’t change the state of the world, the only consequence is the output value it returns. Pure functions are idempotent, you can run them as many times as you like and get the same result.

Pure functions are composable. If you combine one pure function with another then the result will also be pure. A pipeline composed of functions is much like one giant function.

Side-effects aren’t reproducible

Non-local state makes functions sensitive to context

Python
R

counter = 0

def show_count():
  print(f'Count is {counter}')

show_count()

Count is 0

counter <- 0

show_count <- function() {
  cat("Count is", counter) # concatenate and print
}

show_count()

Count is 0

Sometime later…

Python
R

counter = counter + 1

show_count() # now returns a different value

Count is 1

counter <- counter + 1

show_count() # now returns a different value

Count is 1

Explicit inputs/ outputs let us separate code and context

Python
R

counter = 0

def describe_count(count):
  return f'Count if {count}' # interpolation

print(describe_count(counter)) # rendering

Count if 0

counter <- 0

describe_count <- function(count) {
  paste("Count is", count) # string interpolation
}

print(describe_count(counter))

[1] "Count is 0"

Sometime later…

Python
R

counter = counter + 1

print(describe_count(counter)) # receives a different value

Count if 1

counter <- counter + 1

print(describe_count(counter)) # receives a different value

[1] "Count is 1"

We define describe_count in terms of the count value. Now the dependency on this variable is explicit. Given the same input, this function will always return the same output. It’s reproducible. Indeed it now returns an explicit output value reproducibly. The output is now data, not a side-effect so we can use it elsewhere. The call to print happens outside of the core logic of string interpolation.

The code on this slide has essentially the same functionality as before but now that it’s reproducible it’s much easier to reason about. We’ve made the dependency on the context explicit rather than implicitly relying on global state. The focus of the function is on the value of the count, not the variable and it’s place in memory.

State mutations like this are exactly what’s happening when you reassign a variable or values within a data frame.

This example might look trivial but state mutation is a pernicious source of subtle bugs. Mutating state in place may yield savings in computer memory, but it imposes costs on the human capacity to reason about the flow of data through your code as you try to keep a track of a running program in your mind.

Pure functions operating on immutable data are reproducible. You lose these guarantees once your pipeline has side-effects and state mutations.

import random

def toss_coins(n):
    return random.choices(['heads','tails'], k=n)

toss_coins(5)

['tails', 'tails', 'heads', 'heads', 'heads']

toss_coins <- function(n) sample(c("heads","tails"), n, replace = TRUE)

toss_coins(5)

[1] "heads" "tails" "tails" "heads" "heads"

Sometime later…

Python
R

toss_coins(5) # result differs

['tails', 'tails', 'heads', 'heads', 'heads']

toss_coins(5) # result differs

[1] "tails" "tails" "tails" "tails" "heads"

We can make the context explicit

Python
R

random.seed(1234) # set state deterministically

toss_coins(5)

['tails', 'heads', 'heads', 'tails', 'tails']

set.seed(1234) # set state deterministically

toss_coins(5)

[1] "tails" "tails" "tails" "tails" "heads"

Sometime later…

Python
R

random.seed(1234) # reset the state again

toss_coins(5) # result matches

['tails', 'heads', 'heads', 'tails', 'tails']

set.seed(1234) # reset the state again

toss_coins(5) # result matches

[1] "tails" "tails" "tails" "tails" "heads"

I/O is side-effecting

Input is a side-effect

Python
R

df = pandas.read_csv('/home/robin/data-290224-final.csv')

df <- readr::read_csv("/home/robin/data-290224-final.csv")

Output is a side-effect

Python
R

result.to_csv('~/results/output.csv')

readr::write_csv(result, "~/results/output.csv")

A more ubiquitous source of side-effects is I/O (input/ output). This doesn’t just apply to third-party APIs or our own database, even the filesystem is non-local state as far as our programs are concerned.

It’s not uncommon to see pipelines or notebooks start like this.

The filepath is idiosyncratic and this code won’t be reproducible on other people’s machines unless they coincidentally have a user called robin with this file in the home directory. Indeed this probably won’t be reproducible on Robin’s machine at a future date unless care is taken to keep that file in place.

More generally, even if the filepath is dependable, there’s no guarantee that the content of the CSV file itself won’t change.

Even the writing of files is not reproducible. This function alone can’t guarantee that the destination directory ~/results exists.

These problems become all the more apparent with APIs or databases where network interruptions and other external change can jeopardise the reproducibility of your program.

Dependency-injection is explicit

Python
R

import sqlalchemy

my_engine = create_engine(...)
my_conn = my_engine.connect()

library(DBI)

my_conn <- dbConnect(...)

Instead of relying on global state:

Python
R

def users:
  return pd.read_sql_table('users', my_conn)

users()

users <- function() {
  dbReadTable(my_conn, "users")
}

users()

We can make dependencies explicit:

Python
R

def users(connection):
  return pd.read_sql_table('users', connection)

users(my_conn)

users <- function(connection) {
  dbReadTable(connection, "users")
}

users(my_conn)

Execution context is explicit

Command-line arguments

Python
R

python -m pipeline input.csv output.parquet

R -f pipeline.r input.csv output.parquet

Environment variables

Python
R

env DB="http://user:pw@localhost:1337" python -m pipeline

env DB="http://user:pw@localhost:1337" R -f pipeline.r

Configuration data

Python
R

python -m pipeline configuration.yaml

R -f pipeline.r configuration.yaml

Functional pipeline,
configured context

Pursuing a separation of these concerns - what Gary Bernhart has called functional core, imperative shell - leads us to a point where all of the dependencies are captured explicitly and their values gathered together into configuration.

Here we have a pipeline composed of pure-functions with all the necessary side-effects contained to explicitly configured contexts at each end.

This makes it easier to maintain reproducible code. When the code is changed you can see how the pieces fit together and what the consequences of a refactoring are on the rest of the pipeline’s code base. When the infrastructure changes you may be lucky enough to only need to change the configuration and not the code at all.

This also helps sub-divide the code into well-factored modules as each component explicitly declares its requirements making them easier to test in isolation. The preceding example makes it trivial to pass in a test database connection.

You’ll note that this hasn’t really solved the ultimate cause of our problems. As I explained at the outset, we shouldn’t expect to ever be able to fix the rest of the universe in place. The best we can do is hope to contain the unreliable bits, pushing them to the edges so we can carve out a space to pursue reproducibility. A small domain where we, like Maxwell’s Daemon, can reverse entropy and bring order to our world.

Lessons from software development

Versions are values over time

It is impossible to step in the same river twice

Heraclitus c.a. 500 BC, possibly apocryphal

alisonhorst/palmerpenguins v0.1.0

s3://alisonhorst/palmerpenguins.csv?versionId=c19a904

Ultimately the version is identified uniquely only by the values in the data. This becomes cumbersome so we need a more succinct version identifier. We can identity versions by a name formed of two parts. One is a location or label we can use as a reference. The other is the version or state in time. The named output and versioned instances.

You should certainly use a version control tool like git to manage your code and models but line-by-line diffs and patches would be a very inefficient way to store data versions. For data, the parallel here isn’t to source control but to software releases.

Data should be treated as an artifact, frozen in time. Artifact repositories will let you record provenance - for example which version of the source or which run of the pipeline was responsible.

You don’t need a bespoke tool, even a simple S3 bucket will let you update a key in place while it records the version history. Some databases support versioning natively or you can create your own snapshots.

This version history let’s us retrieve the exact conditions for a pipeline run so we can rollback and reproduce the results.

Be wary of any upstream source that can mutate over time without providing some means of identifying and distinguishing versions. You can of course defend against this to an extent by keeping a track of the data you receive with hashes or recording copies of API transactions in caches.

Be a good data citizen and surface versioning information about your own outputs to downstream consumers so that they might ensure reproducibility in their workflow.

Automation proves reproducibility

Builds should be:

deterministic
automated
ephemeral

Just because you can re-run your pipeline locally, it doesn’t prove that it’s reproducible. It’s not demonstrably reproducible until it’s running on ephemeral resources in a build system.

A build system runs your workflow recording the versions of the input data, source code, and output. The pipeline run itself is immutable and identified for posterity with a build number. Done correctly, we don’t need to reproduce the pipeline. This contract serves as a guarantee that you’ll simply get the same result.

CLICK

This requires that the process of assembling your dependencies is automated and reproducible itself. No more hunting through messages or chasing colleagues to find the random excel file that makes the pipeline work. You can’t expect a build server to study the readme and figure it out for itself. Tacit knowledge must be explicitly codified. You can’t claim that everything’s fine just because “it works on my machine”. The build server is a shared consensus on configuration. A canonical source you can refer to to see how things work. Automation enables collaboration.

CLICK

Building on an ephemeral stack further enforces the discipline by preventing you from relying on state. I’d argue that the success of containerisation comes as much from virtualisation as from the fact that Dockerfiles, for example, automate dependency management down to the operating system level.

CLICK

Don’t just say it’s reproducible. Prove it!

Continuous Everything

Continuous integration: run, test and package code
Continuous training: fit parameters or run experiments
Continuous delivery: deploy models and applications for inference
Continuous evaluation: calculate and monitor performance metrics

Once you have versioned dependencies and an automated build system you can develop continuously.

A reproducible approach enables continuous cycles as it ensures you can always rebuild something from scratch and let it run idempotently (which is to say repeatedly with the same effect).

By having the discipline to avoid relying on state mutation you can stop worrying about it. When you don’t need to manage or even think about side-effects it’s easy to cancel things, restart them, or even scale horizontally (it’s easier to spin-up concurrent compute resources if you don’t need to coordinate state).

When you can run a system continuously you establish tighter feedback loops. You can understand the impact of changes sooner. Feedback loops help you to iterate quickly and catch issues early to prevent bugs. Repeated execution encourages us to streamline our architecture helping us become more efficient and reduce costs. Continuously exercising and testing your system helps to ensure reliability by giving weakness no space to hide.

There’s an overwhelming choice of Continuous Integration and Continuous Deployment tools - CI/CD - for software development like GitHub actions, Travis or Jenkins. There’s also tools and workflow systems specific to data and model development like Apache Airflow.

Notebooks as an anti-pattern

Read-Evaluate-Print-Loop

This is where REPL-driven development shines. In RStudio, Emacs, or VS Code you have two windows: one to edit code (here on the left) and the other to run it (on the right). You send code from the editor to the console in a read-evaluate-print-loop process: the code is read and evaluated by the interpreter with the result printed out to the console, we then loop back to a prompt ready for more input.

The REPL let’s us write code experimentally - we can execute each line to confirm it works as intended. You can tweak code at the REPL’s prompt until you’ve got it right. Once you find something you want to use, share or just keep for future reference you can copy it back up to the editor. This let’s us iterate quickly in a carefree, non-reproducible way before finding something worth making reproducible. We can explore methods without needing to worry about architecture.

The Jupyter notebook approach mixes these two concerns. You end-up with a long document full of half-discarded experiments, which invariably becomes dependent on a non-linear execution sequence that’s impossible to reproduce. People don’t tend to think or write code linearly, and we should embrace that not punish it. Editing is a fundamental programming skill.

Of course you can edit in Jupyter, but it doesn’t arise as naturally because the workflow doesn’t require it.

You can still do literate programming without notebooks. You can integrate code and documentation using alternate tools like quarto for creating articles, presentations or books reproducibly. You can start from your report and develop backwards, pulling the analysis into shape. For those that haven’t tried it yet I’d also recommend documentation-driven design. Before you write any code you start with a vignette explaining how to use it which helps to ensure your syntax and semantics are intuitive and well-factored. You can then pull the implementation into existence documenting each function with examples that are executed as part of the package build process. This let’s you do “specification by example” along with your unit tests.

By separating reproducible, literate programming from ephemeral, interactive exploration we get the best of both worlds and avoid the risks of one contaminating the other. We can re-run the build process from scratch in an isolated environment and can’t get ourselves twisted up with state mutation. If it’s too slow to rebuild and threatening our fast feedback cycle then we can cache chunks or write intermediate results out to disk.

Engineering reproducibility

Decay is a physical inevitability but maths is eternal
Pure functions on immutable data are reproducible
Keep the core of your data flow as pure as possible
Extract side-effects and inject dependencies explicitly
Use version control for code, models and data
Automate workflows for continuous development

Functional Reproducibility

robsteranium.github.io/functional-reproducibility

robin@infonomics.ltd.uk

robingower.com

Engineering reproducibility in the face of entropy

Isn’t code reproducible?

None of them are reproducible!

Pure functions are reproducible

Side-effects aren’t reproducible

Non-local state makes functions sensitive to context

Explicit inputs/ outputs let us separate code and context

The context isn’t always apparent

We can make the context explicit

I/O is side-effecting

Dependency-injection is explicit

Execution context is explicit

Functional pipeline,configured context

Lessons from software development

Versions are values over time

Automation proves reproducibility

Continuous Everything

Notebooks as an anti-pattern

Read-Evaluate-Print-Loop

Engineering reproducibility

Engineering reproducibility
in the face of entropy

Functional pipeline,
configured context