You can’t escape entropy

Reproducible adj.: (of a measurement, experiment etc) capable of being reproduced at a different time or place and by different people.

Have you ever written the perfect program?

Has it still run unchanged 6 months later?

Can your colleagues run it without you?

Just because your analysis is executable, it doesn’t mean the results are reproducible.

Data ages. Libraries change. Machines differ. Servers go down. Bits rot. Entropy is inescapable.

We can learn how to engineer reproducibility by drawing on techniques from functional programming and the MLOps movement. The code examples are provided in Python but the lessons apply to any language.

NASA (2011) “Benevolent Monster” Solar Flare

Physics tells us that energy can’t be created or destroyed but that doesn’t make it an infinitely enduring resource.

We can only use energy when we transform it from one form to another.

Schreinmer, R.M. (1912) Deutsches Museum, München

We can turn the chemical energy stored in coal into heat by burning it. A fire turns water into steam. Steam expands to drive pistons becoming mechanical energy that we can use. We do work by transforming energy.

But once the water has become steam and expanded it can’t expand again unless it is first condensed by cooling.

Though we’ve not lost energy we have created entropy. Entropy is energy that’s not available to do work with.

Increasing Disorder CC BY Htkym

In order to use energy we need order in the universe. As depicted on the left, we have an ordered division between a cold place A and a hot place B.

Once we’ve made use of the difference the heat evens out and now both places are merely warm, as depicted on the right.

Maxwell’s Demon CC BY Htkym

The picture shows a thought experiment. Maxwell’s daemon opens a door between two vessels letting the faster gas molecules pass and closing it just in time to keep the slower ones separated. In doing so it appears to defy the 2nd law of thermodynamics by creating order from disorder and reversing entropy.

Abandoned industrial building by Dan Meyers on Unsplash

In an isolated system, absent any magical demons, while the amount of energy remains the same, entropy is always increasing. We can create pockets of order but the overall trend is towards disorder. Decay is inevitable.

Advanced Deep Extragalactic Survey - NASA, ESA, CSA

Eventually our universe will reach a state of complete disorder. A uniform cold vacuum with no exploitable differences. The heat death of the universe.

This is the way the world ends, not with a bang but a whimper - T. S. Eliot

Of course we don’t generally worry about this cosmic tragedy and I’m not saying that the work we do attempting to bring order to data is futile. Far from it.

This is merely a reminder that we should not have the hubris to expect our programs will work forever while our physical reality tends inexorably towards disorder. No-one, not even the greatest engineers or scientists, are exempt.

We have reason for hope though. Not least because we have a tool that exists beyond the physical constraints of reality…

Mathematics is eternal

\[f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]

The mathematics we use is pure and eternal. Functions are infinitely reproducible.

Here’s the probability density function of the normal distribution.

Pi and Eulers number are real constants. Given the same mean and standard deviation this formula will always plot the same bell-curve and so always return the same probability density for any given value of x.

This is all very well, but - as beautiful as this formula is - it’s not much use on it’s own.

We can put that maths to use by using code to plot the probability density of the normal distribution. Here I’ve used the default mean of zero and standard deviation of 1 which gives us the “Standard Normal”.

In practical terms we need to do computations to use maths and computers are certainly subject to entropy.

Computation sits on top of a whole stack of dependencies from your code though packages and programming languages down to machine instructions and the bare metal itself. None of this is permanent.

Computation is impermanent

theregister.com/2016/03/23/npm_left_pad_chaos/

Consider the tale of Leftpad… not even a dozen lines of JavaScript was removed from NPM in 2016. It brought front-end development to a screeching halt as build failures cascaded across the web.

Consider Gallery 404, a museum of obsolete digital art . None of these art works are reproducible any longer. Server’s go down. Domains expire. Adobe flash is long gone.

Some museums now require digital art to be supplied along with working hardware (and backup systems) so that it can continue running over the longer term.

Engineering reproducibility in the face of entropy

So, how can we guard against this inevitable bit rot?

How can we design and engineer our code to be reproducible?

It’s time to leave the metaphysics and consider some more practical guidance.

As I alluded to earlier, we can turn to maths for help. In code that manifests most directly in terms of functional programming. I’ll explain some lessons from that approach then go on to describe ideas from software engineering that have found a home in the data world under neologisms like “DataOps”, “MLOps” or “ModelOps”. Outside of the marketing hype we can think of this as just rigorous data engineering.

Let’s start though, with a question…

Isn’t code reproducible?

Which of these statements is reproducible?

Reading from a database:

users = pd.read_sql_table('users', connection)

users <- DBI::dbReadTable(connection, "users")

Drawing a sample at random:

coin_tosses = random.choices(['heads','tails'], 10)

coin_tosses <- sample(c("heads","tails"), 10, replace = TRUE)

Writing to a filesystem:

pd.to_csv(data, 'output.csv')

readr::write_csv(data, "output.csv")

None of them are. This was a trick question!

Reproducibility requires that doing the same thing gives you the same result.

None of those statements - even writing to disk - are reproducible.

Let’s explore why and what you can do about it.

We saw the definition at the beginning, but what does it really mean to be reproducible?

Pure functions are reproducible

The output of a pure function only depends on it’s input. The result doesn’t change if it’s calculated a second time or by another person; for the same input you always get the same output. You could replace a function call with it’s return value in your program. This is known as referential transparency. Indeed you could replace the function body with a lookup table that records the relevant output for each input.

A pure function has no side effects. Running it doesn’t change the state of the world, the only consequence is the output value it returns. Pure functions are idempotent, you can run them as many times as you like and get the same result.

Pure functions are composable. If you combine one pure function with another then the result will also be pure. A pipeline composed of functions is much like one giant function.

Side-effects aren’t reproducible

This is all very well in theory but we can’t continue piping data around in circles ad nauseum. At some point we have to interact with the outside world - read from a database, deploy a website, email a report. In practice we need side-effects.

Side-effects are intended consequences that happen outside of a function or a pipeline’s output. We also use the term “side-effect” to refer to causes that exist outside of a function or a pipeline’s direct inputs, that is to say non-local state.

These side-effects are what make our pipelines useful allowing them to interact with the world. They’re also what cause our pipelines to become non-reproducible.

This mightn’t be very obvious if it’s the first time you’re hearing this, so let’s look at some examples.

Non-local state makes functions sensitive to context

counter = 0

def show_count():
  print(f'Count is {counter}')

show_count()

Count is 0

counter <- 0

show_count <- function() {
  cat("Count is", counter) # concatenate and print
}

show_count()

Count is 0

Sometime later…

counter = counter + 1

show_count() # now returns a different value

Count is 1

counter <- counter + 1

show_count() # now returns a different value

Count is 1

Here we can see a side-effecting show_count() function. It implicitly depends on the counter variable that is defined in the global scope. This means that it’s result depends on the context in which it is run.

Any time you call show_count() you could get a different result. It isn’t reproducible. If the value of counter changes, then so does the result.

This function also has no return value. The call to print out the result is itself a side-effect. In this case a null value is returned. We can’t use it in a reproducible pipeline.

This might seem a bit contrived but using side-effects like this is it’s quite common. I count 51 pandas methods offering inplace mutation of their argument.

Let’s refactor this into a pure function.

Explicit inputs/ outputs let us separate code and context

counter = 0

def describe_count(count):
  return f'Count if {count}' # interpolation

print(describe_count(counter)) # rendering

Count if 0

counter <- 0

describe_count <- function(count) {
  paste("Count is", count) # string interpolation
}

print(describe_count(counter))

[1] "Count is 0"

Sometime later…

counter = counter + 1

print(describe_count(counter)) # receives a different value

Count if 1

counter <- counter + 1

print(describe_count(counter)) # receives a different value

[1] "Count is 1"

We define describe_count in terms of the count value. Now the dependency on this variable is explicit. Given the same input, this function will always return the same output. It’s reproducible. Indeed it now returns an explicit output value reproducibly. The output is now data, not a side-effect so we can use it elsewhere. The call to print happens outside of the core logic of string interpolation.

The code on this slide has essentially the same functionality as before but now that it’s reproducible it’s much easier to reason about. We’ve made the dependency on the context explicit rather than implicitly relying on global state. The focus of the function is on the value of the count, not the variable and it’s place in memory.

State mutations like this are exactly what’s happening when you reassign a variable or values within a data frame.

This example might look trivial but state mutation is a pernicious source of subtle bugs. Mutating state in place may yield savings in computer memory, but it imposes costs on the human capacity to reason about the flow of data through your code as you try to keep a track of a running program in your mind.

Pure functions operating on immutable data are reproducible. You lose these guarantees once your pipeline has side-effects and state mutations.

The context isn’t always apparent

import random

def toss_coins(n):
    return random.choices(['heads','tails'], k=n)

toss_coins(5)

['tails', 'tails', 'heads', 'heads', 'tails']

toss_coins <- function(n) sample(c("heads","tails"), n, replace = TRUE)

toss_coins(5)

[1] "heads" "tails" "heads" "heads" "tails"

Sometime later…

toss_coins(5) # result differs

['heads', 'heads', 'tails', 'heads', 'tails']

toss_coins(5) # result differs

[1] "heads" "tails" "heads" "heads" "heads"

Some side effects are not as obvious.

The random.choices function depends not just on the arguments you pass it but also the state of a random number generator (RNG).

In software the RNG is not truly random but rather a pseudo-random process. It gives highly erratic results but follows a predictable process if you know the starting state. The starting state is seeded by some source that varies such as the date or a hardware source like /dev/random which collects noise from device drivers.

This is a side-effect for our function, causing it’s output to differ each time it’s run.

Indeed each time I render this document I get a different set of coin tosses. Although executable, this example (and a consequence the whole document) isn’t reproducible.

We can make the context explicit

random.seed(1234) # set state deterministically

toss_coins(5)

['tails', 'heads', 'heads', 'tails', 'tails']

set.seed(1234) # set state deterministically

toss_coins(5)

[1] "tails" "tails" "tails" "tails" "heads"

Sometime later…

random.seed(1234) # reset the state again

toss_coins(5) # result matches

['tails', 'heads', 'heads', 'tails', 'tails']

set.seed(1234) # reset the state again

toss_coins(5) # result matches

[1] "tails" "tails" "tails" "tails" "heads"

We can make this function call reproducible by fixing the initial state of the random number generator to a constant value with random.seed.

This level of purity may be helpful from an engineering perspective (for example this slide has a consistent checksum meaning it can be cached), but it would completely undermine certain analytical procedures. For example in cross-validation where we want to see the test/ train split vary to ensure we’re not over-fitting a statistical model. We forego reproducibility in resampling in an effort to achieve reproducibility on out-of-sample predictions.

We might not always want perfect reproducibility. The key is in making that choice consciously.

I/O is side-effecting

Input is a side-effect

df = pandas.read_csv('/home/robin/data-290224-final.csv')

df <- readr::read_csv("/home/robin/data-290224-final.csv")

Output is a side-effect

result.to_csv('~/results/output.csv')

readr::write_csv(result, "~/results/output.csv")

A more ubiquitous source of side-effects is I/O (input/ output). This doesn’t just apply to third-party APIs or our own database, even the filesystem is non-local state as far as our programs are concerned.

It’s not uncommon to see pipelines or notebooks start like this.

The filepath is idiosyncratic and this code won’t be reproducible on other people’s machines unless they coincidentally have a user called robin with this file in the home directory. Indeed this probably won’t be reproducible on Robin’s machine at a future date unless care is taken to keep that file in place.

More generally, even if the filepath is dependable, there’s no guarantee that the content of the CSV file itself won’t change.

Even the writing of files is not reproducible. This function alone can’t guarantee that the destination directory ~/results exists.

These problems become all the more apparent with APIs or databases where network interruptions and other external change can jeopardise the reproducibility of your program.

Dependency-injection is explicit

import sqlalchemy

my_engine = create_engine(...)
my_conn = my_engine.connect()

library(DBI)

my_conn <- dbConnect(...)

Instead of relying on global state:

def users:
  return pd.read_sql_table('users', my_conn)

users()

users <- function() {
  dbReadTable(my_conn, "users")
}

users()

We can make dependencies explicit:

def users(connection):
  return pd.read_sql_table('users', connection)

users(my_conn)

users <- function(connection) {
  dbReadTable(connection, "users")
}

users(my_conn)

The problem isn’t that these side-effects exist; they’re necessary.

The problem is that when they’re implicit they tend to hide dependencies.

We can make them explicit with techniques like dependency injection. The function receives a database connection, for example, as an input instead of relying on it existing in the program’s global state.

Now the function declares its requirements. It draws attention to potential side-effects and forces us to think about them. Here we can see that the users method will need us to manage the connection’s lifecycle rather than taking it for granted.

The code is more modular. We can re-use this code to get a collection of users from more than one database within the same program.

Execution context is explicit

Command-line arguments

python -m pipeline input.csv output.parquet

R -f pipeline.r input.csv output.parquet

Environment variables

env DB="http://user:pw@localhost:1337" python -m pipeline

env DB="http://user:pw@localhost:1337" R -f pipeline.r

Configuration data

python -m pipeline configuration.yaml

R -f pipeline.r configuration.yaml

Likewise we can pull side-effects through to the very edges of our pipelines, passing in state (like data or configuration) only from the outermost execution context.

This means the pipeline no longer has to be concerned with coordinating state. We may want some graceful error handling, but the pipeline itself should be context-free and reproducible.

Extending this practice into the execution context itself is what leads to the Infrastructure as Code and GitOps/ DevOps movement where machines and services can themselves be provisioned reproducibly from declarative configuration.

Functional pipeline,
configured context

Pursuing a separation of these concerns - what Gary Bernhart has called functional core, imperative shell - leads us to a point where all of the dependencies are captured explicitly and their values gathered together into configuration.

Here we have a pipeline composed of pure-functions with all the necessary side-effects contained to explicitly configured contexts at each end.

This makes it easier to maintain reproducible code. When the code is changed you can see how the pieces fit together and what the consequences of a refactoring are on the rest of the pipeline’s code base. When the infrastructure changes you may be lucky enough to only need to change the configuration and not the code at all.

This also helps sub-divide the code into well-factored modules as each component explicitly declares its requirements making them easier to test in isolation. The preceding example makes it trivial to pass in a test database connection.

You’ll note that this hasn’t really solved the ultimate cause of our problems. As I explained at the outset, we shouldn’t expect to ever be able to fix the rest of the universe in place. The best we can do is hope to contain the unreliable bits, pushing them to the edges so we can carve out a space to pursue reproducibility. A small domain where we, like Maxwell’s Daemon, can reverse entropy and bring order to our world.

Lessons from software development

We can go a step further than the above and strive for reproducibility in our interaction with data outside of the pipeline.

In these final slides I’ll discuss some lessons the data community can learn from software development. In particular version control and continuous delivery.

Versions are values over time

We’ve seen that the functional approach leads us to immutability - data that is unchanging. Steps in your pipeline can only communicate with one another through their arguments and return values. We aren’t mutating the objects passing through the pipeline or the global state. The intermediate values are immutable.

But we need to change things to do useful work over time as the external context changes, for example as upstream data sources are updated. How do we cope when the upstream source is mutable? We can use versions to control external change.

Versions identify immutable states of data as it evolves over time.

It is impossible to step in the same river twice

Heraclitus c.a. 500 BC, possibly apocryphal

What is a river? A place? Some water?

As you stand in a stream you can feel the water flowing past your feet. You won’t step into that same water again yet we still think of that place as the river.

You can’t step in the same river twice. Time marches on. The river is formed of a different body water and you too will be a different person.

Here’s Heraclitus standing in a river…

… different gallons of water flow round his feet…

… what’s constant is the riverbed.

It’s helpful for us to distinguish between the water and the riverbed.

This is how we name rivers. The name refers to the river bed, not the water in it!

We can apply this as a metaphor for data pipelines. We distinguish the output from each instance of it.

alisonhorst/palmerpenguins v0.1.0

s3://alisonhorst/palmerpenguins.csv?versionId=c19a904

Ultimately the version is identified uniquely only by the values in the data. This becomes cumbersome so we need a more succinct version identifier. We can identity versions by a name formed of two parts. One is a location or label we can use as a reference. The other is the version or state in time. The named output and versioned instances.

You should certainly use a version control tool like git to manage your code and models but line-by-line diffs and patches would be a very inefficient way to store data versions. For data, the parallel here isn’t to source control but to software releases.

Data should be treated as an artifact, frozen in time. Artifact repositories will let you record provenance - for example which version of the source or which run of the pipeline was responsible.

You don’t need a bespoke tool, even a simple S3 bucket will let you update a key in place while it records the version history. Some databases support versioning natively or you can create your own snapshots.

This version history let’s us retrieve the exact conditions for a pipeline run so we can rollback and reproduce the results.

Be wary of any upstream source that can mutate over time without providing some means of identifying and distinguishing versions. You can of course defend against this to an extent by keeping a track of the data you receive with hashes or recording copies of API transactions in caches.

Be a good data citizen and surface versioning information about your own outputs to downstream consumers so that they might ensure reproducibility in their workflow.

Automation proves reproducibility

Builds should be:

deterministic
automated
ephemeral

Just because you can re-run your pipeline locally, it doesn’t prove that it’s reproducible. It’s not demonstrably reproducible until it’s running on ephemeral resources in a build system.

A build system runs your workflow recording the versions of the input data, source code, and output. The pipeline run itself is immutable and identified for posterity with a build number. Done correctly, we don’t need to reproduce the pipeline. This contract serves as a guarantee that you’ll simply get the same result.

This requires that the process of assembling your dependencies is automated and reproducible itself. No more hunting through messages or chasing colleagues to find the random excel file that makes the pipeline work. You can’t expect a build server to study the readme and figure it out for itself. Tacit knowledge must be explicitly codified. You can’t claim that everything’s fine just because “it works on my machine”. The build server is a shared consensus on configuration. A canonical source you can refer to to see how things work. Automation enables collaboration.

Building on an ephemeral stack further enforces the discipline by preventing you from relying on state. I’d argue that the success of containerisation comes as much from virtualisation as from the fact that Dockerfiles, for example, automate dependency management down to the operating system level.

Don’t just say it’s reproducible. Prove it!

Continuous Everything

Continuous integration: run, test and package code
Continuous training: fit parameters or run experiments
Continuous delivery: deploy models and applications for inference
Continuous evaluation: calculate and monitor performance metrics

Once you have versioned dependencies and an automated build system you can develop continuously.

A reproducible approach enables continuous cycles as it ensures you can always rebuild something from scratch and let it run idempotently (which is to say repeatedly with the same effect).

By having the discipline to avoid relying on state mutation you can stop worrying about it. When you don’t need to manage or even think about side-effects it’s easy to cancel things, restart them, or even scale horizontally (it’s easier to spin-up concurrent compute resources if you don’t need to coordinate state).

When you can run a system continuously you establish tighter feedback loops. You can understand the impact of changes sooner. Feedback loops help you to iterate quickly and catch issues early to prevent bugs. Repeated execution encourages us to streamline our architecture helping us become more efficient and reduce costs. Continuously exercising and testing your system helps to ensure reliability by giving weakness no space to hide.

There’s an overwhelming choice of Continuous Integration and Continuous Deployment tools - CI/CD - for software development like GitHub actions, Travis or Jenkins. There’s also tools and workflow systems specific to data and model development like Apache Airflow.

Notebooks as an anti-pattern

In closing I’d just like to urge caution about the use of Jupyter notebooks.

They’re an effective for exploratory work but I’ve found them to discourage reproducible data analysis.

I should start by saying that exploratory data analysis doesn’t need to be reproducible. We should have the freedom to explore the data interactively as we build up our understanding. We should expect to encounter dead-ends. We should be prepared to throw away code.

Read-Evaluate-Print-Loop

This is where REPL-driven development shines. In RStudio, Emacs, or VS Code you have two windows: one to edit code (here on the left) and the other to run it (on the right). You send code from the editor to the console in a read-evaluate-print-loop process: the code is read and evaluated by the interpreter with the result printed out to the console, we then loop back to a prompt ready for more input.

The REPL let’s us write code experimentally - we can execute each line to confirm it works as intended. You can tweak code at the REPL’s prompt until you’ve got it right. Once you find something you want to use, share or just keep for future reference you can copy it back up to the editor. This let’s us iterate quickly in a carefree, non-reproducible way before finding something worth making reproducible. We can explore methods without needing to worry about architecture.

The Jupyter notebook approach mixes these two concerns. You end-up with a long document full of half-discarded experiments, which invariably becomes dependent on a non-linear execution sequence that’s impossible to reproduce. People don’t tend to think or write code linearly, and we should embrace that not punish it. Editing is a fundamental programming skill.

Of course you can edit in Jupyter, but it doesn’t arise as naturally because the workflow doesn’t require it.

You can still do literate programming without notebooks. You can integrate code and documentation using alternate tools like quarto for creating articles, presentations or books reproducibly. You can start from your report and develop backwards, pulling the analysis into shape. For those that haven’t tried it yet I’d also recommend documentation-driven design. Before you write any code you start with a vignette explaining how to use it which helps to ensure your syntax and semantics are intuitive and well-factored. You can then pull the implementation into existence documenting each function with examples that are executed as part of the package build process. This let’s you do “specification by example” along with your unit tests.

By separating reproducible, literate programming from ephemeral, interactive exploration we get the best of both worlds and avoid the risks of one contaminating the other. We can re-run the build process from scratch in an isolated environment and can’t get ourselves twisted up with state mutation. If it’s too slow to rebuild and threatening our fast feedback cycle then we can cache chunks or write intermediate results out to disk.

Engineering reproducibility

Decay is a physical inevitability but maths is eternal
Pure functions on immutable data are reproducible
Keep the core of your data flow as pure as possible
Extract side-effects and inject dependencies explicitly
Use version control for code, models and data
Automate workflows for continuous development

You can’t escape entropy

Mathematics is eternal

Computation is impermanent

Engineering reproducibility in the face of entropy

Isn’t code reproducible?

Pure functions are reproducible

Side-effects aren’t reproducible

Non-local state makes functions sensitive to context

Explicit inputs/ outputs let us separate code and context

The context isn’t always apparent

We can make the context explicit

I/O is side-effecting

Dependency-injection is explicit

Execution context is explicit

Functional pipeline,configured context

Lessons from software development

Versions are values over time

Automation proves reproducibility

Continuous Everything

Notebooks as an anti-pattern

Read-Evaluate-Print-Loop

Engineering reproducibility

Functional pipeline,
configured context