Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific Contributions

Scientific code is different from production software. Scientific code, by producing results that are then analyzed and interpreted, participates in the elaboration of scientific conclusions. This imposes specific constraints on the code that are often overlooked in practice. We articulate, with a small example, five characteristics that a scientific code in computational science should possess: re-runnable, repeatable, reproducible, reusable, and replicable. The code should be executable (re-runnable) and produce the same result more than once (repeatable); it should allow an investigator to reobtain the published results (reproducible) while being easy to use, understand and modify (reusable), and it should act as an available reference for any ambiguity in the algorithmic descriptions of the article (replicable).

Replicability is a cornerstone of science.If an experimental result cannot be re-obtained by an independent party, it merely becomes, at best, an observation that may inspire future research (Mesirov, 2010;Open Science Collaboration, 2015).Replication issues have received increased a ention in recent years, with a particular focus on medicine and psychology (Iqbal, Wallach, Khoury, Schully, & Ioannidis, 2016).One could think that computer science would mostly be shielded from such issues, since a computer program describes precisely what it does and is easily disseminated to other researchers without alteration.
But precisely because it is easy to believe that if a program runs once and gives the expected results it will do so forever, crucial steps to transform working code into meaningful scientific contributions are rarely undertaken (Sandve, Nekrutenko, Taylor, & Hovig, 2013;Schwab, Karrenbach, & Claerbout, 2000).Computer science is plagued by replication problems, in part, because it seems impervious to them.A program can fail as a scientific contribution in many different ways for many different reasons.Borrowing the terms coined by Goble (Goble, 2016), for a program to contribute to science, it should be re-runnable (R 1 ), repeatable (R 2 ), reproducible (R 3 ), reusable (R 4 ) and replicable (R 5 ).Let us illustrate this with a small example, a random walk (Hughes, 1995) wri en in Python: In the code above, the random.choicefunction randomly returns either +1 or -1.e instruction "for i in xrange(10):" executes the next three indented lines ten times.Executed, this program would display: -1, 0, -1, 0, -1, 0, -1, 0, 1, 2 # with the steps being -1, +1, -1, +1, -1, +1, -1, +1, +1, +1 Output What could go wrong with such a simple program?Well… Re-runnable (R1 ) Have you ever tried to re-run a program you wrote some years ago?It can o en be frustratingly hard.Part of the problem is that technology is evolving at a fast pace and you cannot know in advance how the system, the so ware and the libraries your program depends on will evolve.Since you wrote the code, you may have reinstalled or upgraded your operating system.e compiler, interpreter or set of libraries installed may have been replaced with newer versions.You may find yourself ba ling with arcane issues of library compatibility-thoroughly orthogonal to your immediate research goals-to execute again a code that worked perfectly before.To be clear, it is impossible to write future-proof code, and the best efforts can be stymied by the smallest change in one of the dependencies.At the same time, modernizing an unmaintained ten-year-old code can reveal itself to be an arduous and expensive undertaking-and precarious, since each change risks affecting the semantics of the program.Rather than trying to predict the future or painstakingly dusting off old code, an o en more straightforward solution is to recreate the old execution environment 1 .For this to happen however, the dependencies in terms of systems, so ware and libraries must be made clear enough.
A re-runnable code is one that can be run again when needed, and in particular more than the one time that was needed to produce the results.It is important to notice that the re-runnability of a code is not an intrinsic property.Rather, it depends on the context, and becomes increasingly difficult as the code ages.erefore, to be and remain re-runnable on other researchers' computers, a re-runnable code should describe-with enough details to be recreated-an execution environment in which it is executable.As shown by (Collberg & Proebsting, 2016), this is far from being either obvious or easy.In our case, the R 0 version of our tiny walker seems to imply that any version of Python would be fine.
is not the case: it uses the print instruction and the xrange operator, both specific to Python 2. e print instruction, available in Python 2 (a version still widely used; support is scheduled to stop in 2020), has been deprecated in Python 3 (first released in 2008, almost a decade ago) in favor or a print function, while the xrange operator has been replaced by the range operator in Python 3. In order to try to future-proof the code a bit, we might as well target Python 3, as is done in the R 1 version.Incidentally, it remains compatible with Python 2. But whichever version is chosen, the crucial step here is to document it.

Repeatable (R 2 )
e code is running and producing the expected results.e next step is to make sure that you can produce the same output over successive runs of your program.In other words, the next step is to make your program deterministic, producing repeatable output.Repeatability is valuable.If a run of the program produces a particularly puzzling result, repeatability allows you to scrutinize any step of the execution of the program by re-running it again with extraneous prints, or inside a debugger.Repeatability is also useful to prove that the program did indeed produce the published results.Repeatability is not always possible or easy (Diethelm, 2012;Courtès & Wurmus, 2015).But for sequential programs not depending on analog inputs, it o en comes down to controlling the initialization of the pseudorandom number generators (RNG).
For our program, that means se ing the seed of the random module.We may also want to save the output of the program to a file, so that we can easily verify that consecutive runs do produce the same output: eyeballing differences is unreliable and time-consuming, and therefore won't be done systematically.Se ing seeds should be done carefully.Using 439 as a seed in the previous program would result in ten consecutive +1 steps2 , which-although a perfectly valid random walk-lend itself to a gross misinterpretation of the overall dynamics of the algorithm.Verifying that the qualitative aspects of the results and the conclusions that are made are not tied to a specific initialization of the pseudo-random generator is an integral part of any scientific undertaking in computational science; this is usually done by repeating the simulations multiple times with different seeds.
Reproducible (R3 ) e R 2 code seems fine enough, but it hides several problems that come to light when trying to reproduce results.A result is said to be reproducible if another researcher can take the original code and input data, execute it, and re-obtain the same result (Peng, Dominici, & Zeger, 2006).As explained by Donoho, Maleki, Rahman, Shahram, and Stodden (Donoho et al., 2009), scientific practice must expect that errors are ubiquitous, and therefore be robust to them.Ensuring reproducibility is a fundamental step toward this: it provides other researchers the means to verify that the code does indeed produce the published results, and to scrutinize the procedures it used to produce them.As demonstrated by Mesnard and Barba (Mesnard & Barba, 2016), reproducibility is hard.
For instance, the R 2 program will not produce the same results all the time.It will, because it is repeatable, produce the same results over repeated executions.But it will not necessarily do so over different execution environments.e cause is to be found in a change that occurred in the pseudorandom number generator between Python 3.2 and Python 3.3.Executed with Python 2.7 to 3.2, the code will produce the sequence -1, 0, 1, 0, -1, -2, -1, 0, -1, -2.But with Python 3.3 to 3.6, it will produce -1, -2, -1, -2, -1, 0, 1, 2, 1, 0. With future versions of the language, it may change still.For the R3 version, we abandon the use of the random.choicefunction in favor of the random.uniformfunction, whose behavior is consistent across versions 2.7 to 3.6 of Python.
Because any dependency of a program-to the most basic one, the language itself-can change its behavior from one version to another, executability (R 1 ) and determinism (R 2 ) are necessary but not sufficient for reproducibility.e exact execution environment used to produce the results must also be specified-rather than the broadest set of environments where the code can be effectively run.In other words, assertions such as "the results were obtained with CPython 3.6.1"are more valuable, in a scientific context, than "the program works with Python 3.x and above".With the increasing complexity of computational stacks, retrieving and deciding what is pertinent (CPU architecture?operating system version?endianness?) might be non-trivial.A good rule of thumb is to include more information than necessary rather than not enough, and some rather than none.
Recording the execution environment is only the first step.
e R 2 program uses a random seed but does not keep a trace of it except in the code.Should the code change a er the production of the results, someone provided with the last version of the code will not be able to know which seed was used to produce the results, and would need to iterate through all possible random seeds, an impossible task in practice 3 .
is is why result files should come alongside their context, i.e. an exhaustive list of the parameters used as well as a precise description of the execution environment, as the R 3 code does.e code itself is part of that context: the version of the code must be recorded.It is common for different results or different figures to have been generated by different versions of the code.Ideally, all results should originate from the same (and last) version of the code.But for long or expensive computations, this may not be feasible.In that case, the result files should contain the version of the code that was used to produce it.
is information can be obtained from the version control so ware. is also allows, if some errors are found and corrected a er some results have been obtained, to identify which ones should be recomputed.In R 3 , the code records the git revision, and whether the repository holds uncommi ed changes when the computation starts.
Published results should obviously come from version of the code where every change and every file has been commi ed. is includes pre-processing, post-processing and plo ing code.Plo ing code may seem mundane, but it is as vulnerable as any other piece of the code to bugs and errors.When it comes to checking that the reproduced data match the one published in the article, however, figures can reveal themselves to be imprecise and cumbersome, and sometimes plain unusable.To avoid having to manually overlay pixelated plots, published figures should be accompanied by their underlying data (coordinates of the plo ed points) in the supplementary data to allow straightforward numeric comparisons.
Another good practice is to make the code self-verifiable.In R 3 , a short unit test is provided, that allows the code to verify its own reproducibility.Should this test fail, then there is li le hope of reproducing the results.Of course, passing the test does not guarantee anything.
It is obvious that reproducibility implies availability.As shown in (Collberg & Proebsting, 2016), code is o en unavailable, or only available upon request.While the la er may seem sufficient, changes in email address, changes in career, retirement, a busy inbox or poor archiving practices can make a code just as unreachable.Code and input data and result data should be available with the published article, as supplementary data, or through a DOI link to a scientific repository such as Figshare or Zenodo4 .e codes presented in this article are available in the GitHub repository github.com/rougier/random-walkand at doi.org/10.5281/zenodo.848217.To recap, reproducibility implies re-runnability and repeatability and availability, yet imposes additional conditions.Dependencies and platforms must be described as precisely and as specifically as possible.Parameters values and inputs should accompany the result files.e data and scripts behind the graphs must be published.Unit tests are a good way to embed self-diagnostics of reproducibility in the code.Reproducibility is hard, yet tremendously necessary.

Reusable (R 4 )
Making your program reusable means it can be easily used, and modified, by you and other people, inside and outside your lab.Ensuring your program is reusable is advantageous for a number of reasons.
For you, first.Because the you now and the you in two years are two different persons.Details on how to use the code, its limitations, its quirks, may be present to your mind now, but will probably escape you in six months (Donoho et al., 2009).Here, comments and documentation can make a significant difference.Source code reflects the results of the decisions that were made during its creation, but not the reasons behind those decisions.In science, where the method and its justification ma er as much as the results, those reasons are precious knowledge.In that context, a comment on how a given parameter was chosen (optimization, experimental data, educated guess), why a library was chosen over another (conceptual or technical reasons?) is valuable information.
Reusability of course directly benefits other researchers from your team and outside of it.e easier it is to use your code, the lower the threshold is for other to study, modify and extend it.Scientists constantly face the constraint of time: if a model is available, documented, and can be installed, run and understood all in a few hours, it will be preferred over another that would require weeks to reach the same stage.A reproducible and reusable code offers a platform both verifiable and easy-to-use, fostering the development of derivative works by other researchers on solid foundations.ose derivative works contribute to the impact of your original contribution.
Having more people examining and using your code also means that potential errors have a higher chance to be caught.If people start using your program, they will most likely report bugs or malfunctions they encounter.If you're lucky enough, they might even propose either bug fixes or improvements, hence improving the overall quality of your so ware. is process contributes to the long-term reproducibility to the extent people continue to use and maintain the program.Despite all this, reusability is o en overlooked, and it is not hard to see why.Scientists are rarely trained in so ware engineering, and reusability can represent an expensive endeavour if undertaken as an a erthought, for li le tangible short-term benefits, for a codebase that might, a er all, see only a single use.And, in fact, reusability is not as indispensable a requirement as re-runnability, repeatability and reproducibility.Yet, some simple measures can tremendously increase reusability, and at the same time strengthen reproducibility and re-runnability over the long-term.
Avoid hardcoded or magic numbers.Magic numbers are numbers present directly in the source code, that do not have a name and therefore can be difficult to interpret semantically.Hardcoded values are variables that cannot be changed through a function argument or a parameter configuration file.To be modified, they involve editing the code, which is cumbersome and error-prone.In the R 3 code, the seed and the number of steps are respectively hardcoded and magic.
Similarly, code behavior should not be changed by commenting/uncommenting code (Wilson et al., 2017).Modification of the behavior of the code, required when different experiments examine slightly different conditions, should always be explicitly set through parameters accessible to the end-user.is improves reproducibility in two ways: it allows those conditions to be recorded as parameters in the result files, and it allows to define separate scripts to run or configuration files to load to produce each of the figures of the published paper.With a documentation explaining which script or configuration file corresponds to which experiment, reproducing the different figures becomes straightforward.
# Copyright (c) 2017 Nicolas P. Rougier and Fabien C.Y. Benureau # Release under the BSD 2-clause license # Tested with CPython 3.6.2/ macOS 10.12.6 / 64 bits architecture import sys, subprocess, datetime, random def generate_walk(count, x0=0, step=1, seed=0): """ Random walk count: number of steps x0 : initial position (default 0) step : step size (default 1) seed : seed for the initialization of the random generator (default 0) """ random.seed(seed)x = x0 walk = [] for i in range(count): if random.uniform(-1,+1) > 0: x += 1 else: x -= 1 walk.append(x)return walk def generate_results(count, x0=0, step=1, seed=0): """Compute a walk and return it alongside its context""" Documentation is one of the most potent tools for reusability.A proper documentation on how to install and run the so ware o en makes the difference whether other researchers manage to use it or not.A comment describing what each function does, however evident, can avoid hours of headscratching.Great code may need few comments.Scientists, however, are not always brilliant developers.Of course, bad, complicated code should be rewri en until is simple enough to explain itself.But realistically, this is not always going to be done: there is simply not enough incentive for it.ere, a comment that explains the intentions and reasons behind a block of code can be tremendously useful.
Reusability is not a strict requirement for scientific code.But it has many benefits, and a few simple measures can foster it considerably.To complement the R 4 version provided here, we provide an example repository of a re-runnable, repeatable, reproducible and reusable random walk code.e repository is available on GitHub github.com/benureau/r5and here doi.org/10.5281/zenodo.848284.

Replicable (R 5 )
Having made a so ware reusable offers an additional way to find errors, especially if your scientific contribution is popular.Unfortunately, this is not always effective, and some recent cases have dramatically illustrated the considerable impact a bug can have in science (Eklund, Nichols, & Knutsson, 2016) or in our every-day life (Durumeric et al., 2014). is is why, as explained by Peng et al. (Peng et al., 2006), the replication of important findings by multiple independent investigators is fundamental to the accumulation of scientific evidence.
Replicability is the implicit assumption that any article that does not provide the code source makes: that the description it provides of the algorithms is sufficiently precise and complete to re-obtain the results it presents.While every published article should strive for replicability, it is seldom obtained.In fact, absent an explicit effort to make an algorithmic description replicable, there is li le probability that it will be.
is is because most papers strive to communicate the main ideas behind their contribution is terms as simple and as clear as possible, so that the reader may be able to easily understand them and the results that are presented to him.Trying to ensure replicability in the main text adds a myriad of esoteric details that are not conceptually significant and clu er the explanations.erefore, unless the writer dedicates an addendum or a section of the supplementary information for technical details specifically aimed at replicability, the information will not be there because there are incentives not to do so.
But even when those details are present, the best efforts may fall short because an oversight, a typo or a difference between what is evident for the writer and for the reader (Mesnard & Barba, 2016).Minute changes in the numerical estimation of a common first-order differential equation can have significant impact (Crook, Davison, & Plesser, 2013).Hence, a reproducible code plays an important role alongside its article: it is a objective catalog of all the implementation details.
A researcher seeking to replicate published results might first consider only the article.If she fails to replicate the results, she will consult the original code, and with it be able to pinpoint why her code and the code of the authors differ in behavior.Because a mistake on their part?Hers?Or a difference in a seemingly innocuous implementation detail?A fine analysis of why a particular algorithmic description is lacking or ambiguous or why a minor implementation decision is in fact crucial to obtain the published results is of great scientific value.Such an analysis can only be done with access to both the article and the code.With only the article, the researcher will o en be unable to understand why she failed to replicate the results, and will naturally be inclined to only report replication successes.
Replicability, therefore, does not negate the necessity of reproducibility.In fact, it o en relies on it.To illustrate this, let us consider what could be the textual description of the random walker (as it would be wri en in an article describing it): e model uses the Mersene Twister generator initialized with the seed 1.At each iteration, a uniform number between -1 (included) and +1 (excluded) is drawn and the sign of the result is used for generating a positive or negative step. is description, while somewhat precise, forgoes-as it is common-the initialization of the variables (here the starting value of the walk: 0), and the technical details about which implementation of the RNG is used.It may look innocuous.A er all, the Python documentation, states that "Python uses the Mersenne Twister as the core generator.It produces 53-bit precision floats and has a period of 2**19937-1".Someone trying to replicate the work however might choose to use the RNG from the NumPy library.e NumPy library is extensively used in the science community, and it provides an implementation of the Mersene Twister generator too.Unfortunately, the way the seed is interpreted by the two implemen-tations is different, yielding different random sequences.
Here we are able to replicate exactly5 the behavior of the pure-Python random walker by se ing the internal state of the NumPy RNG appropriately, but only because we have access to specific technical details (the use of the random module of the standard Python library of CPython 3.6.1),or to the code itself.
But there are still more subtle problems with the description given above.If we look more closely at it, we can realize that nothing is said about the specific case of 0 when generating a step.Do we have to consider 0 to be a positive or a negative step?Without further information and without the original code, it is up to the reader to decide.Likewise, the description is ambiguous regarding the first element of the walk.Is the initialization value included (it was not in our codes so far)? is slight difference might affect the statistics of short runs.
All these ambiguities in the description of an algorithm pile up; some are inconsequential (the 0 case has null probability), but some may affect the results in important ways.ey are mostly inconspicuous to the reader and o entimes, to the writer as well.In fact, the best way to ferret out the ambiguities, big and small, of an article is to replicate it.is is one of the reasons why the ReScience journal (Rougier et al., 2017) has been created.is journal targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research is reproducible.
Code is an integral part of any submission to the ReScience journal.During the review process, reviewers run the submi ed code, may criticise its quality and its ease-of-use, and verify the reproduciblity of the replication.e Journal of Open Source So ware (Smith et al., 2017) functions similarly: testing the code is a fundamental part of the review process.

Conclusion
roughout the evolution of a small random walk example implemented in Python, we illustrated some of the issues that may plague scientific code.e code may be correct and of good quality, but still many problems may reduce its contribution to scientific knowledge.To make these problems explicit, we articulated five characteristics that a code should possess to be a useful part of a scientific publication: it should be re-runnable, repeatable, reproducible, reusable and replicable.
Running old code on tomorrow's computer and so ware stacks may not be possible.But recreating the old code's execution environment may be: to ensure that the long-term re-runnability of a code, its execution environment must be documented.For our example, a single comment went a long way to transform the R 0 code into the R 1 (re-runnable) one.
Science is built on verifying the results of others.is is harder to do if each execution of the code produce a different result.While for complex parallel workflow this may not be possible, in all instances where it is feasible the code should be repeatable.is allows future researchers to examine exactly how a specific result was produced.Most of the time, what is needed is to set or record the initial state of the pseudo-random number generator, as what done in the R 2 (repeatable) version.
Even more care is needed to make a code reproducible.e exact execution environment, code and parameters used must be recorded and embedded in the results files, as the R 3 (reproducible) version does.Furthermore, the code must be made available as supplementary data with the whole computational workflow, from preprocessing steps to plo ing scripts.
Making code reusable is a stretch goal that can yield tremendous benefits for you, your team and other researchers.Taken into account during development rather than as an a erthought, simple measures can avoid hours of head-scratching for others, and for yourself-in a few years.Documentation is paramount here, even if it is a single comment per function, as it was done in the R 4 (reusable) version.
Finally, there is the belief that an article should suffice by itself: the descriptions of the algorithms present in the paper should suffice to re-obtain (to replicate) the published results.For well-wri en papers that precisely dissociate conceptually significant aspects from irrelevant implementation details, that may be.But scientific practice should not assume the best of cases.Science assumes that errors can crop up everywhere.Every paper is a mistake or a forgo en parameter away from irreproducibility.Replication efforts use the paper first, and then the reproducible code that comes along with it whenever the paper falls short of being precise enough to be reimplemented.
In conclusion, the R 3 (reproducible) form should be accepted as the minimum scientific standard (Wilson et al., 2017).
is means this should be actually checked by reviewers and publishers when code is part of a work worth being published.But it's hardly the case today.
Compared to psychology or biology, the replication issues of computational works have reasonable and efficient solutions.But making sure that these solutions are adopted will not be solved by articles such as this one.Just like in other fields, we have to modify the incentives for the researchers to publish by adopting exigences, enforced domain-wide, on what constitutes an acceptable scientific computational work.