plot(dnorm, -3, 3, main="The Standard Normal Distribution")
You can’t escape entropy
- Reproducible adj.
- (of a measurement, experiment etc) capable of being reproduced at a different time or place and by different people.
Have you ever written the perfect program?
Has it still run unchanged 6 months later?
Can your colleagues run it without you?
Just because your analysis is executable, it doesn’t mean the results are reproducible.
Data ages. Libraries change. Machines differ. Servers go down. Bits rot. Entropy is inescapable.
We can learn how to engineer reproducibility by drawing on techniques from functional programming and the MLOps movement. The code examples are provided in R but the lessons apply to any language.
Physics tells us that energy can’t be created or destroyed but that doesn’t make it an infinitely enduring resource.
We can only use energy when we transform it from one form to another.
We can turn the chemical energy stored in coal into heat by burning it. A fire turns water into steam. Steam expands to drive pistons becoming mechanical energy that we can use. We do work by transforming energy.
But once the water has become steam and expanded it can’t expand again unless it is first condensed by cooling.
Though we’ve not lost energy we have created entropy. Entropy is energy that’s not available to do work with.
In order to use energy we need order in the universe. As depicted on the left, we have an ordered division between a cold place A and a hot place B.
Once we’ve made use of the difference the heat evens out and now both places are merely warm, as depicted on the right.
The picture shows a thought experiment. Maxwell’s daemon opens a door between two vessels letting the faster gas molecules pass and closing it just in time to keep the slower ones separated. In doing so it appears to defy the 2nd law of thermodynamics by creating order from disorder and reversing entropy.
In an isolated system, absent any magical demons, while the amount of energy remains the same, entropy is always increasing. We can create pockets of order but the overall trend is towards disorder. Decay is inevitable.
Eventually our universe will reach a state of complete disorder. A uniform cold vacuum with no exploitable differences. The heat death of the universe.
This is the way the world ends, not with a bang but a whimper - T. S. Eliot
Of course we don’t generally worry about this cosmic tragedy and I’m not saying that the work we do attempting to bring order to data is futile. Far from it.
This is merely a reminder that we should not have the hubris to expect our programs will work forever while our physical reality tends inexorably towards disorder. No-one, not even the greatest engineers or scientists, are exempt.
We have reason for hope though. Not least because we have a tool that exists beyond the physical constraints of reality…
Mathematics is eternal
\[f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]
The mathematics we use is pure and eternal. Functions are infinitely reproducible.
Here’s the probability density function of the normal distribution.
Pi and Eulers number are real constants. Given the same mean and standard deviation this formula will always plot the same bell-curve and so always return the same probability density for any given value of x.
This is all very well, but - as beautiful as this formula is - it’s not much use on it’s own.
We can put that maths to use by using R to plot density of the normal distribution. Here I’ve used the default mean of zero and standard deviation of 1 which gives us the “Standard Normal”.
In practical terms we need to do computations to use maths and computers are certainly subject to entropy.
Computation sits on top of a whole stack of dependencies from your code though packages and programming languages down to machine instructions and the bare metal itself. None of this is permanent.
Computation is impermanent
Consider the tale of Leftpad… not even a dozen lines of JavaScript was removed from NPM in 2016. It brought front-end development to a screeching halt as build failures cascaded across the web.
Consider Gallery 404, a museum of obsolete digital art . None of these art works are reproducible any longer. Server’s go down. Domains expire. Adobe flash is long gone.
Some museums now require digital art to be supplied along with working hardware (and backup systems) so that it can continue running over the longer term.
Engineering reproducibility in the face of entropy
So, how can we guard against this inevitable bit rot?
How can we design and engineer our code to be reproducible?
It’s time to leave the metaphysics and consider some more practical guidance.
As I alluded to earlier, we can turn to maths for help. In code that manifests most directly in terms of functional programming. I’ll explain some lessons from that approach then go on to describe ideas from software engineering that have found a home in the data world under neologisms like “DataOps”, “MLOps” or “ModelOps”. Outside of the marketing hype we can think of this as just rigorous data engineering.
Let’s start though, with a question…
Isn’t code reproducible?
Which of these statements is reproducible?
Reading from a database:
<- DBI::dbReadTable(connection, "users") users
Drawing a sample at random:
<- sample(c("heads","tails"), 10, replace = TRUE) coin_tosses
Writing to a filesystem:
::write_csv(data, "output.csv") readr
None of them are. This was a trick question!
Reproducibility requires that doing the same thing gives you the same result.
None of those statements - even writing to disk - are reproducible.
Let’s explore why and what you can do about it.
We saw the definition at the beginning, but what does it really mean to be reproducible?
Pure functions are reproducible
The output of a pure function only depends on it’s input. The result doesn’t change if it’s calculated a second time or by another person; for the same input you always get the same output. You could replace a function call with it’s return value in your program. This is known as referential transparency. Indeed you could replace the function body with a lookup table that records the relevant output for each input.
A pure function has no side effects. Running it doesn’t change the state of the world, the only consequence is the output value it returns. Pure functions are idempotent, you can run them as many times as you like and get the same result.
Pure functions are composable. If you combine one pure function with another then the result will also be pure. A pipeline composed of functions is much like one giant function.
Side-effects aren’t reproducible
This is all very well in theory but we can’t continue piping data around in circles ad nauseum. At some point we have to interact with the outside world - read from a database, deploy a website, email a report. In practice we need side-effects.
Side-effects are intended consequences that happen outside of a function or a pipeline’s output. We also use the term “side-effect” to refer to causes that exist outside of a function or a pipeline’s direct inputs, that is to say non-local state.
These side-effects are what make our pipelines useful allowing them to interact with the world. They’re also what cause our pipelines to become non-reproducible.
This mightn’t be very obvious if it’s the first time you’re hearing this, so let’s look at some examples.
Non-local state makes functions sensitive to context
<- 0
counter
<- function() {
show_count cat("Count is", counter) # concatenate and print
}
show_count()
Count is 0
Sometime later…
<- counter + 1
counter
show_count()
Count is 1
Here we can see a side-effecting show_count()
function. It implicitly depends on the counter
variable that is defined in the global scope. This means that it’s result depends on the context in which it is run.
Any time you call show_count()
you could get a different result. It isn’t reproducible. If the value of counter
changes, then so does the result.
This function also has no return value. The call to cat
or print
is itself a side-effect. In this case a NULL
is returned invisibly. We can’t use it in a reproducible pipeline.
This might seem a bit contrived, but you could instead imagine the counter as being a data frame that you’re mutating in place.
Let’s refactor this into a pure function.
Explicit inputs/ outputs let us separate code and context
<- 0
counter
<- function(count) {
describe_count paste("Count is", count) # string interpolation
}
print(describe_count(counter))
[1] "Count is 0"
Sometime later…
<- counter + 1
counter
print(describe_count(counter))
[1] "Count is 1"
We define describe_count
in terms of the count
value. Now the dependency on this variable is explicit. Given the same input, this function will always return the same output. It’s reproducible. Indeed it now returns an explicit output value reproducibly. The output is now data, not a side-effect so we can use it elsewhere. The call to print
happens outside of the core logic of text rendering.
Again this example is a bit contrived but the same rationale motivates the broom
package which extracts tabular data from models so that you can manipulate results programmatically where the base R summary
method only prints it’s tabular output to the screen.
The code on this slide has essentially the same functionality as before but because it’s now reproducible it’s much easier to reason about. We’ve made the dependency on the context explicit rather than implicitly relying on global state. The focus of the function is on the value of the count
, not the variable and it’s place in memory.
Context mutations like this are exactly what’s happening when you reassign a variable or values within a data frame.
This example might look trivial but state mutation is a pernicious source of subtle bugs. Mutating state in place may yield savings in computer memory, but it imposes costs on the human capacity to reason about the flow of data through your code as you try to keep a track of a running program in your mind.
Pure functions operating on immutable data are reproducible. You lose these guarantees once your pipeline has side-effects and state mutations.
The context isn’t always apparent
<- function(n) sample(c("heads","tails"), n, replace = TRUE)
toss_coins
toss_coins(5)
[1] "tails" "tails" "tails" "tails" "heads"
Sometime later…
toss_coins(5)
[1] "heads" "tails" "heads" "tails" "tails"
Some side effects are not as obvious.
The sample
function depends not just on the arguments you pass it but also the state of a random number generator (RNG).
In software the RNG is not truly random but rather a pseudo-random process. It gives highly erratic results but follows a predictable process if you know the starting state. The starting state is seeded by some source that varies such as the date or a hardware source like /dev/random
which collects noise from device drivers.
This is a side-effect for our function, causing it’s output to differ each time it’s run.
Indeed each time I render this document I get a different set of coin tosses. Although executable, this example (and a consequence the whole document) isn’t reproducible.
We can make the context explicit
set.seed(1234) # set state deterministically
toss_coins(5)
[1] "tails" "tails" "tails" "tails" "heads"
Sometime later…
set.seed(1234) # reset the state again
toss_coins(5)
[1] "tails" "tails" "tails" "tails" "heads"
We can make this function call reproducible by fixing the initial state of the random number generator to a constant value with set.seed
.
This level of purity may be helpful from an engineering perspective (for example this slide has a consistent checksum meaning it can be cached), but it would completely undermine certain analytical procedures. For example in cross-validation where we want to see the test/ train split vary to ensure we’re not over-fitting a statistical model. We forego reproducibility in resampling in an effort to achieve reproducibility on out-of-sample predictions.
We might not always want perfect reproducibility. The key is in making that choice consciously.
I/O is side-effecting
Input is a side-effect
<- readr::read_csv("/home/robin/data-290224-final.csv") df
Output is a side-effect
::write_csv(result, "~/results/output.csv") readr
A more ubiquitous source of side-effects is I/O (input/ output). This doesn’t just apply to third-party APIs or our own database, even the filesystem is non-local state as far as our programs are concerned.
It’s not uncommon to see pipelines or notebooks start like this.
The filepath is idiosyncratic and this code won’t be reproducible on other people’s machines unless they coincidentally have a user called robin
with this file in the home directory. Indeed this probably won’t be reproducible on Robin’s machine at a future date unless care is taken to keep that file in place.
More generally, even if the filepath is dependable, there’s no guarantee that the content of the CSV file itself won’t change.
Even the writing of files is not reproducible. This function alone can’t guarantee that the destination directory ~/results
exists.
These problems become all the more apparent with APIs or databases where network interruptions and other external change can jeopardise the reproducibility of your program.
Dependency-injection is explicit
library(DBI)
<- dbConnect(...) my_conn
Instead of relying on global state:
<- function() {
get_data dbReadTable(my_conn, "users")
}
get_data()
We can make dependencies explicit:
<- function(connection) {
get_data dbReadTable(connection, "users")
}
get_data(my_conn)
The problem isn’t that these side-effects exist; they’re necessary.
The problem is that when they’re implicit they tend to hide dependencies.
We can make them explicit with techniques like dependency injection. The function receives a database connection, for example, as an input instead of relying on it existing in the program’s global state.
Now the function declares it’s requirements. It draws attention to potential side-effects and forces us to think about them. Here we can see that get_data
will need us to manage the connection’s lifecycle rather than taking it for granted.
The code is more modular. We can re-use this code within the same program allowing us to get users from a test or staging database as well.
Execution context is explicit
Command-line arguments
R -f pipeline.r input.csv output.parquet
Environment variables
env DB="http://user:pw@localhost:1337" R -f pipeline.r
Configuration data
R -f pipeline.r configuration.yaml
Likewise we can pull side-effects through to the very edges of our pipelines, passing in state (like data or configuration) only from the outermost execution context.
This means the pipeline no longer has to be concerned with coordinating state. We may want some graceful error handling, but the pipeline itself should be context-free and reproducible.
Extending this practice into the execution context itself is what leads to the Infrastructure as Code and GitOps/ DevOps movement where machines and services can themselves be provisioned reproducibly from declarative configuration.
Functional pipeline,
configured context
Pursuing a separation of these concerns - what Gary Bernhart has called functional core, imperative shell - leads us to a point where all of the dependencies are captured explicitly and their values gathered together into configuration.
Here we have a pipeline composed of pure-functions with all the necessary side-effects contained to explicitly configured contexts at each end.
This makes it easier to maintain reproducible code. When the code is changed you can see how the pieces fit together and what the consequences of a refactoring are on the rest of the pipeline’s code base. When the infrastructure changes you may be lucky enough to only need to change the configuration and not the code at all.
This also helps sub-divide the code into well-factored modules as each component explicitly declares its requirements making them easier to test in isolation. The preceding example makes it trivial to pass in a test database connection.
You’ll note that this hasn’t really solved the ultimate cause of our problems. As I explained at the outset, we shouldn’t expect to ever be able to fix the rest of the universe in place. The best we can do is hope to contain the unreliable bits, pushing them to the edges so we can carve out a space to pursue reproducibility. A small domain where we, like Maxwell’s Daemon, can reverse entropy and bring order to our world.
Lessons from software development
We can go a step further than the above and strive for reproducibility in our interaction with data outside of the pipeline.
In these final slides I’ll discuss some lessons the data community can learn from software development. In particular version control and continuous delivery.
Versions are values over time
We’ve seen that the functional approach leads us to immutability - data that is unchanging. Steps in your pipeline can only communicate with one another through their arguments and return values. We aren’t mutating the objects passing through the pipeline or the global state. The intermediate values are immutable.
But we need to change things to do useful work over time as the external context changes, for example as upstream data sources are updated. How do we cope when the upstream source is mutable? We can use versions to control external change.
Versions identify immutable states of data as it evolves over time.
It is impossible to step in the same river twice
Heraclitus c.a. 500 BC, possibly apocryphal
What is a river? A place? Some water?
As you stand in a stream you can feel the water flowing past your feet. You won’t step into that same water again yet we still think of that place as the river.
You can’t step in the same river twice. Time marches on. The river is formed of a different body water and you too will be a different person.
Here’s Heraclitus standing in a river…
… different gallons of water flow round his feet…
… what’s constant is the riverbed.
It’s helpful for us to distinguish between the water and the riverbed.
This is how we name rivers. The name refers to the river bed, not the water in it!
We can apply this as a metaphor for data pipelines. We distinguish the output from each instance of it.
alisonhorst/palmerpenguins v0.1.0
s3://alisonhorst/palmerpenguins.csv?versionId=c19a904
Ultimately the version is identified uniquely only by the values in the data. This becomes cumbersome so we need a more succinct version identifier. We can identity versions by a name formed of two parts. One is a location or label we can use as a reference. The other is the version or state in time. The named output and versioned instances.
You should certainly use a version control tool like git
to manage your code and models but line-by-line diffs and patches would be a very inefficient way to store data versions. For data, the parallel here isn’t to source control but to software releases.
Data should be treated as an artifact, frozen in time. Artifact repositories will let you record provenance - for example which version of the source or which run of the pipeline was responsible.
You don’t need a bespoke tool, even a simple S3 bucket will let you update a key in place while it records the version history. Some databases support versioning natively or you can create your own snapshots.
This version history let’s us retrieve the exact conditions for a pipeline run so we can rollback and reproduce the results.
Be wary of any upstream source that can mutate over time without providing some means of identifying and distinguishing versions. You can of course defend against this to an extent by keeping a track of the data you receive with hashes or recording copies of API transactions in caches.
Be a good data citizen and surface versioning information about your own outputs to downstream consumers so that they might ensure reproducibility in their workflow.
Automation proves reproducibility
Builds should be:
- deterministic
- automated
- ephemeral
Just because you can re-run your pipeline locally, it doesn’t prove that it’s reproducible. It’s not demonstrably reproducible until it’s running on ephemeral resources in a build system.
A build system runs your workflow recording the versions of the input data, source code, and output. The pipeline run itself is immutable and identified for posterity with a build number. Done correctly, we don’t need to reproduce the pipeline. This contract serves as a guarantee that you’ll simply get the same result.
This requires that the process of assembling your dependencies is automated and reproducible itself. No more hunting through messages or chasing colleagues to find the random excel file that makes the pipeline work. You can’t expect a build server to study the readme and figure it out for themselves. Tacit knowledge must be explicitly codified. You can’t claim that everything’s fine just because “it works on my machine”. The build server is a shared consensus on configuration. A canonical source you can refer to to see how things work. Automation enables collaboration.
Building on an ephemeral stack further enforces the discipline by preventing you from relying on state. I’d argue that the success of containerisation comes as much from virtualisation as from the fact that Dockerfile
s, for example, automate dependency management down to the operating system level.
Don’t just say it’s reproducible. Prove it!
Continuous Everything
- Continuous integration
- run, test and package code
- Continuous training
- fit parameters or run experiments
- Continuous delivery
- deploy models and applications for inference
- Continuous evaluation
- calculate and monitor performance metrics
Once you have versioned dependencies and an automated build system you can develop continuously.
A reproducible approach enables continuous cycles as it ensures you can always rebuild something from scratch and let it run idempotently (which is to say repeatedly with the same effect).
By having the discipline to avoid relying on state mutation you can stop worrying about it. When you don’t need to manage or even think about side-effects it’s easy to cancel things, restart them, or even scale horizontally (it’s easier to spin-up concurrent compute resources if you don’t need to coordinate state).
When you can run a system continuously you establish tighter feedback loops. You can understand the impact of changes sooner. Feedback loops help you to iterate quickly and catch issues early to prevent bugs. Repeated execution encourages us to streamline our architecture helping us become more efficient and reduce costs. Continuously exercising and testing your system helps to ensure reliability by giving weakness no space to hide.
There’s an overwhelming choice of Continuous Integration and Continuous Deployment tools - CI/CD - for software development like GitHub actions, Travis or Jenkins. There’s also tools and workflow systems specific to data and model development like Apache Airflow.
Notebooks as an anti-pattern
In closing I’d just like to urge caution about the use of Jupyter notebooks.
They’re an effective for exploratory work but I’ve found them to discourage reproducible data analysis.
I should start by saying that exploratory data analysis doesn’t need to be reproducible. We should have the freedom to explore the data interactively as we build up our understanding. We should expect to encounter dead-ends. We should be prepared to throw away code.
Read-Evaluate-Print-Loop
This is where REPL-driven development shines. In RStudio, Emacs, or VS Code you have two windows: one to edit code (here on the left) and the other to run it (on the right). You send code from the editor to the console in a read-evaluate-print-loop process: the code it’s read and the evaluated by the R interpreter with the result printed inline in the console, we then loop back to a prompt ready for more input.
The REPL let’s us write code experimentally - we can execute each line to confirm it works as intended. You can tweak code in the REPL console until you’ve got it right. Once you find something you want to use, share or just keep for future reference you can copy it back up to the editor. This let’s us iterate quickly in a carefree, non-reproducible way before finding something worth making reproducible. We can explore methods without needing to worry about architecture.
The Jupyter notebook approach which mixes these two concerns. You end-up with a long document full of half-discarded experiments, which invariably becomes dependent on a non-linear execution sequence that’s impossible to reproduce. People don’t tend to think or write code linearly, and we should embrace that not punish it. Editing is a fundamental programming skill.
Of course you can edit in Jupyter, but it doesn’t arise as naturally because the workflow doesn’t require it.
You can still do literate programming without notebooks. You can integrate code and documentation using alternate tools like quarto
for creating articles, presentations or books reproducibly. You can start from your report and develop backwards, pulling the analysis into shape. For those that haven’t tried it yet I’d also recommend building packages in R for any non-trivial code as you can use tools like ROxygen
to do documentation-driven design. Before you write any code you start with a vignette explaining how to use it which helps to ensure your syntax and semantics are intuitive and well-factored. You can then pull the implementation into existence. Likewise RDocs let you document each function with examples that are executed as part of the package build process. This let’s you do “specification by example” along with your unit tests.
By separating reproducible literate programming from ephemeral interactive exploration we get the best of both worlds and avoid the risks of one contaminating the other. We can re-run the build process from scratch in an isolated environment and can’t get ourselves twisted up with state mutation. If it’s too slow to rebuild and threatening our fast feedback cycle then we can cache chunks or write intermediate results out to disk.
Engineering reproducibility
- Decay is a physical inevitability but maths is eternal
- Pure functions on immutable data are reproducible
- Keep the core of your data flow as pure as possible
- Extract side-effects and inject dependencies explicitly
- Use version control for code, models and data
- Automate workflows for continuous development