Benchmarking Autonomous Scattering Experiments Illustrated on TAS

With the advancement of artificial intelligence and machine learning methods, autonomous approaches are recognized to have great potential for performing more efficient scattering experiments. In our view, it is crucial for such approaches to provide thorough evidence about respective performance improvements in order to increase acceptance within a scientific community. Therefore, we propose a benchmarking procedure designed as a cost-benefit analysis that is applicable to any scattering method sequentially collecting data during an experiment. For a given approach, the performance assessment is based on how much benefit, given a certain cost budget, it is able to acquire in predefined test cases. Different approaches thus get a chance for comparison and can make their advantages explicit and visible. Key components of the procedure, i.e., cost measures, benefit measures, and test cases, are made precise for the setting of three-axes spectrometry (TAS) as an illustration. Finally, we discuss neglected aspects and possible extensions for the TAS setting and comment on the procedure’s applicability to other scattering methods. A Python implementation of the procedure to simplify its utilization by interested researchers from the field is also provided.


INTRODUCTION
Scattering experiments have so far been carried out in a manual or semi-automated way, i.e., experimenters had to organize the measuring process and determine what and where to measure (next). With the rise of artificial intelligence and machine learning techniques, it is natural to ask whether there are autonomous approaches that allow performing experiments in a more efficient way. In other words, it is worthwhile to see if, for a fixed cost budget like experimental time available, autonomous approaches can perform "better" experiments. In the following, by "autonomous approach" and related phrases, we refer to a decision-making algorithm that is combined with an automated communication and analysis infrastructure to create a closed loop with an instrument control system and thus is, after initialization, able to conduct measurements without human intervention.
Indeed, there are already autonomous approaches that have recently been developed for scattering experiments (Noack et al., 2020;Durant et al., 2021a;Durant et al., 2021b;Maffettone et al., 2021;Noack et al., 2021;. A way to compare and assess the performance of different approaches (manual as well as autonomous approaches) is, however, currently lacking. From our perspective, benchmarking and measuring performance is of the utmost importance at this stage for the establishment and progress of autonomous approaches in the field of materials analysis by scattering methods. As examples, similar efforts already developed a benchmark for oxygen evolution reaction catalyst discovery (Rohr et al., 2020) or evaluated the performance of Bayesian optimization (Frazier and Wang, 2016) across several materials science domains (Liang et al., 2021).
In this work, we propose a benchmarking procedure which is designed as a cost-benefit analysis and can be applied to any scattering method sequentially collecting data of a certain quantity of interest during an experiment. Key components of the procedure that mainly drive the performance assessment are cost measures, benefit measures, and test cases. Cost measures specify the type of cost that is to be minimized and benefit measures characterize how "success" is defined. Test cases describe particular scenarios depending on the scattering method and determine which aspects the approaches are tested on. We emphasize that benchmarking processes in the general field of machine learning are potentially fragile (Dehghani et al., 2021) and therefore need to be defined carefully. After the general formulation of the procedure, all of the mentioned components are made precise for the setting of three-axes spectrometry (TAS).
TAS is an established technique for materials analysis by inelastic neutron scattering (Shirane et al., 2002). For decades, three-axes spectrometers have measured the dynamic properties of solids, e.g., phonons and magnetic excitations, for a wide range in both energy transfer (E) and momentum (Q) space and provide the opportunity to detect weak signals with high resolution. Compared to techniques like time-of-flight spectroscopy (TOF), profiting from large detector assemblies, the sequential, point-by-point, measurements in TAS experiments come with the cost of moving the three instrument axes-around a monochromator, a sample, and an analyzer-individually in real space, which is rather slow (seconds to minutes). Furthermore, instead of operating classically with a single detector, recent developments attempt to use multi-detectors for higher data acquisition speed at the expense of reduced measuring flexibility (Kempa et al., 2006;Lim et al., 2015;Groitl et al., 2016;Groitl et al., 2017). However, as a first step, we concentrate on classical TAS experiments with a single detector here.
Semi-automated TAS experiments are often systematically organized on a grid in Q-E space with predefined steps. This procedure is frequently inefficient as it collects a substantial amount of measurement points in the background, i.e., areas with erroneous signals most often coming from undesired scattering events inside the sample itself, the sample environment, or components of the instrument. If autonomous approaches, however, were able to collect information on intensities in signal regions and their shape faster, they could enable a more efficient (manual or autonomous) assessment of the investigated system's behaviour.
The manuscript is divided into sections as follows. Section 2 gives fundamental definitions of the key components. The benchmarking procedure is specified in Section 3 and Section 4 contains an illustration of the key components in a TAS setting. Finally, we provide a discussion in Section 5 and a conclusion in Section 6.

DEFINITIONS
This section provides definitions for the central components of the benchmarking procedure (Section 3), i.e., for test cases, experiments, cost measures, and benefit measures. For TAS experiments, these notions are made precise in Section 4.
Definition (Test case). A test case t is a collection of details necessary to conduct a certain scattering experiment.
Here, a scattering experiment is viewed as a sequential collection of data points the meaning of which depends on the particular scattering method. Recall that an N-tuple, N ∈ N, is a finite ordered list of N elements.
Definition (t-Experiment). Let t be a test case. A t-experiment A A t is an N-tuple of data points where z j denote data points and |A| : N ∈ N is the number of data points. In order to measure costs and benefits of experiments in the context of a certain test case, we need a formalization of cost and benefit measures.
Definition (Cost/Benefit measure). Let t be a test case. Both, a cost measure c c t and a benefit measure μ μ t , are real-valued functions of t-experiments A.

BENCHMARKING PROCEDURE
In this section, we use the notions from Section 2 to formulate the main outcome of this manuscript and suggest a step-by-step procedure for benchmarking scattering experiments. The result of a benchmark is a collection of sequences with benefit values (cf. Table 1) allowing to evaluate the performance of experiments in the context of predefined test cases.
3) Specify a benefit measure μ and define μ (ℓ) : μ t (ℓ) . for a test case t (ℓ) , consists of benefit values (measured with μ μ (ℓ) ) that can be achieved using milestone values C (ℓ) m in column m as cost budgets.

Benchmarking Autonomous Scattering Experiments
For each test case t (ℓ) , perform the following steps: be the collection of the first J data points in A (ℓ) . In particular, be the maximum number of first data points in A (ℓ) possible with a cost budget C ∈ R. Then, for each milestone value C (ℓ) m , compute benefit values using the first J (ℓ) m : Summarizing, a benchmark requires the specification of test cases, a cost measure, a benefit measure, and a sequence of ascending milestone values for each test case. The result is a collection of sequences with benefit values (cf. Table 1) that can be achieved using the milestone values as cost budgets.
Using their results, different approaches can now be compared by, for example, plotting the function for each test case t (ℓ) and each approach. In Section 4.4 and Figure 1 provides a demonstration for a particular test case from the TAS context. Also, in the context of the specified cost and benefit measure, the results could allow to make strong claims of the following kind: "Approach A lead to better experiments in p% of all test cases compared to approaches B, C, etc."

ILLUSTRATION ON TAS
In this section, the general formulation of the benchmarking procedure and its components is made precise for TAS experiments.

TAS Setting
In TAS experiments, the concept of a three-dimensional reciprocal space Q is essential. For a given sample (or material), an element of Q is denoted by q (h,k,l) ⊤ . An element of the energy transfer space E is denoted by ω. Furthermore, we call (q,ω) ⊤ ∈ R r , r 4, the Q-E variables. For details, we refer to Shirane et al. (2002).
Since TAS experiments are carried out only along n ≤ r, n ∈ N, predefined directions in Q-E space, we introduce experiment variables x (x 1 , . . . , x n ) ⊤ ∈ R n and a corresponding affine transformation T : R n → R r to Q-E variables, i.e., for a full rank matrix W ∈ R r×n and an offset b ∈ R r . It follows that for (q,ω) ⊤ ∈ R r . Note that T −1 (T(x)) x for each x ∈ R n , but T(T −1 (q, ω)) (q,ω) ⊤ only for (q,ω) ⊤ ∈ T(R n ). Each experiment variable x k only ranges between respective limits of investigation x ± k ∈ R, i.e., we have that The corresponding domain of interest X ⊆ R n is defined as The representation of a TAS instrument is divided into two structures, an instrument configuration and an instrument.
Definition (Instrument configuration). The tuple for a scan mode sm ∈ {"constant k i ", "constant k f "}, a scan constant (k i or k f ) sc > 0, and a vector of scattering senses se (se mono , se sample , se ana ) ⊤ ∈ {−1, +1} 3 is called an (TAS) instrument configuration.
To measure intensities at a certain point in Q-E space, a TAS instrument needs to steer its axes to six related angles. However, it is enough to regard only a subset of four angles due to dependencies.
The lattice parameters of the monochromator and the analyzer are necessary for a well-defined translation of points in Q-E space to associated angles of instrument axes. The so-called angle map is induced by a sample, its orientation, and an instrument Ins and its configuration Π. Note that angular velocities v k are related to angles Ψ k (q, ω). The domain dom(Ψ) ⊆ Q × E denotes the set of points in Q-E space for which Ψ is well-defined, i.e., points that are reachable by the instrument Ins. Furthermore, we define as the set of points in X such that is well-defined for x ∈ X p . The limits of investigation for the domain of interest X have to be set such that Ψ : X p → [0, π) 4 becomes an injective function. For a given configuration Π and instrument Ins, we can specify a resolution function (Shirane et al., 2002) i.e., each point in T(X p ) ⊆ R r gives an individual resolution function defined over T(X p ). Of course, the exact resolution function depends on additional parameters such as distances between instrument components or beam divergence allowed by collimators. However, these parameters are fixed during an experiment and hence can be omitted here. If Π and Ins are known from the context, we write φ φ Π,Ins . Also, for x ∈ X p , we define A sample and its orientation induce a so-called scattering function s : T(X ) → [0, ∞). Again, for x ∈ X , we define Given a resolution function φ, we get an associated intensity function i : X p → [0, ∞) defined by a convolution of s with φ, i.e., For benchmarking, we assume that i can be exactly evaluated. Of course, this is not possible in real experiments due to background and statistical noise.
For our suggestion of a benefit measure (Section 4.3), we additionally need an intensity threshold τ > 0.
The experimental time is defined as the sum of the cumulative counting time at measurement points in Q-E space and the cumulative time for moving the instrument axes. For simplicity, we assume here that the single counting time, i.e., the counting time at a single measurement point, denoted by T count ≥ 0, is constant for each point. Now, we can specify the notion of a TAS test case and a TAS experiment.
Definition (TAS test case). A tuple t sample, orientation, W, b, X , Π, Ins, τ, T count (21) for a sample and its orientation, an affine transformation induced by a full rank matrix W ∈ R r×n and an offset b ∈ R r (Eq. 8), a domain of interest X ⊆ R n (Eq. 11), an instrument configuration Π, an instrument Ins, an intensity threshold τ > 0, and a single counting time T count ≥ 0 is called a TAS test case.
Note that a TAS test case t induces an intensity function i : X p → [0, ∞).
In the context of a certain test case, a TAS experiment is defined as a collection of intensities i(x) at locations x ∈ X p .

Definition (TAS t-Experiment). Let t be a TAS test case. A TAS t-experiment A A t is an N-tuple of location-intensity pairs
where x j ∈ X p denote measurement locations and i j i(x j ) ≥ 0 are corresponding values of the intensity function i induced by t.

Cost Measure
We need to align our proposition of a cost measure with the limited experimental time, which is the critical quantity in a TAS experiment. Recall that the experimental time is defined as the sum of the cumulative counting time and the cumulative time for axes movement (Section 4.1).
For the cumulative counting time, we define where T count ≥ 0 denotes the constant single counting time.
The cost measure representing the cumulative time for moving the instrument axes is defined as where Ψ (Ψ 1 , . . . , Ψ 4 ) ⊤ denotes the angle map from Eq. 14 and v (v 1 , . . . , v 4 ) ⊤ is the vector of the instrument's angular velocities (Eq. 13). Note that the metric d is fully determined by the TAS test case t. Also, it is indeed a metric (Encyclopedia of Mathematics, 1999) since the angle map Ψ was chosen to be injective.
The measures c count and c axes can be used either individually or additively to form a cost measure representing the entire experimental time. For the latter, we finally define In our opinion, this cost measure is the most suitable to reflect time costs in a general TAS experiment.

Benefit Measure
We propose a benefit measure that measures a type of weighted L 2 approximation error between a benchmark intensity function i i t and an approximationî î (A) resulting from an experiment A A t . For example,î can be constructed with location-intensity pairs from A by linear interpolation or other approximation methods. To compare benefit values across experiments relating to different test cases, benefit measures should be "normalized", i.e., we regard relative errors.
Let us define where · is a norm that enables to control the error measurement. Note that the notion of a "benefit" refers to an "error" in this case, i.e., reducing the benefit measure μ μ t leads to an increase in benefit. A suitable error norm · needs to reflect that a TAS experimenter is more interested in regions of signal than in the background. This suggests that we use i itself in a suitable definition. However, an important constraint is that signal regions with different intensities are weighted equally since they might be equally interesting. For this, we use the intensity threshold τ > 0 from the TAS test case t (Eq. 21) and define for x ∈ X p , i.e., we cut i to a maximum intensity value of τ. As i τ is a nonnegative function, its normalization is a probability density function and can be used for weighting. Finally, we set where for a function h and a density function ρ. Note that · L 2 (X p ,ρ) can be approximated by numerical quadrature rules or estimated by a Monte Carlo approach.

Test Cases
A useful set of test cases represents the variety of different scenarios that can occur in a TAS experiment. It is the set of intensity functions that is particularly important here. Therefore, we suggest to create test cases such that the corresponding induced intensity functions are composed of one or more of the following structures known in the field: • non-dispersive structures (e.g., crystal field excitations), • dispersive structures (e.g., spin waves or acoustic phonons), • (pseudo-)continua (e.g., spinons or excitations in frustrated magnets).
Particular intensity functions can be created by mathematical expressions, physical simulations, or experimental data.
As an example for an outcome of our benchmarking procedure, Figure 1 displays a comparative benchmark result of two approaches for a test case including an intensity function that reflects a soft mode of the ferroelectric phase transition at ∼ 60 K of a transverse optical phonon measured on SnTe (Weber and Heid, 2021).
This figure contains exemplifications of each benchmark component: experimental time (x-axis in Figure 1C) as cost measure (Section 4.2), approximation error (y-axis in Figure 1C) as benefit measure (Section 4.3), a phonon-type intensity function ( Figures 1A,B) as part of the test case (Section 4.4), and milestone values (ticks on x-axis in Figure 1C).

DISCUSSION
In this section, we discuss some open aspects of the TAS setting from Section 4 and the applicability of the benchmarking procedure from Section 3 to other scattering methods. Finally, we refer to a repository containing a software implementation.

Neglected Aspects and Possible Extensions for TAS
The TAS setting does not comprise each detail occurring in a real experiment. Indeed, we neglected some aspects which we, however, see as acceptable deviations. Certainly the most prominent neglected aspect is background and statistical noise. We find it difficult to include them in the benchmarking procedure due to their non-deterministic nature since a comparison of approaches needs to be done in a deterministic setting. The benefit measure from Section 4.3, for instance, makes use of a "true" benchmark intensity function which would not be available in the presence of background or statistical noise.
We assumed that the single counting time is constant for each measurement point. However, since single counting times are determined by the physics of the sample and the requested statistics in real experiments, they might vary at different measurement locations. In our opinion, this variation is negligible in a first step, but can be taken into account in more complex benchmarking setups if desired.
Next, the cost measures from Section 4.2 are rather general and can be applied to any TAS instrument. They can, however, be arbitrarily extended by more details if benchmarking is to be done in the context of a specific instrument. For example, real TAS instruments may differ in more [PANDA ] or less [ThALES (Boehm et al., 2015)] time-consuming procedures for moving k i values.
Furthermore, the intensity functions induced by the test cases described in Section 4.4 are derived in the context of fixed environmental parameters of the sample (such as temperature, external magnetic or electric field, etc.). Future work might extend the four-dimensional Q-E space with these parameters, i.e., r > 4, to allow for more complex intensity functions that would, however, require adjusted cost measures.
Finally, to compute all benefit values for Table 1, the benchmarking procedure requires an approach to perform an experiment that is large enough (Eq. 3). Hence, an autonomous stopping criterion is not tested although we consider it a crucial part of a fully autonomous approach.

Applicability to Other Scattering Methods
The benchmarking procedure is formulated in a modular way and depends only on abstract components like cost measures, benefit measures, and test cases. For TAS experiments, we specified these components in Section 4, but feel that the overall procedure is also applicable for scattering methods other than TAS such as diffraction, reflectivity, SAS/GISAS, or TOF. Indeed, experimenters for each of these methods measure costs and benefits in their own way and investigate different kinds of intensity functions.
Diffraction experiments (Sivia, 2011), for example, may have a setting similar to TAS experiments as the experimenter is interested in intensities defined over Q variables and many diffractometers need to move their components (sample, detector). Therefore, the cost and benefit measures presented above might also be useful in this context. The set of test cases, however, would need to be composed differently since regions of signal are mainly separated and small in shape. Also, additional aspects (such as shadowing, overlapping reflections, missing knowledge of symmetry, etc.) are to be taken into account.

Data Repository
Since the benchmarking procedure has algorithmic structure, we decided to provide an implementation in the form of Python code that computes sequences of benefit values for given cost and benefit measures (cf . Table 1). Also, benchmark components for the TAS setting are already implemented. The repository along with instructions on how to run the code is publicly available . It also contains descriptions of test cases that can be complemented in the future.

CONCLUSION
In this manuscript, we have developed a benchmarking procedure for scattering experiments which is designed as a cost-benefit analysis and based on key components like cost measures, benefit measures, and test cases.
Although we have provided first suggestions for all these components in a TAS setting, the process of finding a suitable benchmark setting for the scattering community in general as well as the TAS community in particular is certainly not finished.
As an outlook for the TAS community, a useful next step could be the inclusion of non-constant single counting times since it has the potential of further savings of experimental time that we do not account for in the current setting. Also, extending the Q-E variables with environmental parameters that were assumed to be fixed would lead to a more comprehensive setting. Finally, we see the contribution to the dynamical set of test cases as another future task for the community.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://jugit.fzjuelich.de/ainx/base.

AUTHOR CONTRIBUTIONS
MG and MTP contributed to the conception and design of the study. All authors took over the implementation of the study. MTP wrote the first draft of the manuscript. CF, MG, MN, AS, and MTP wrote paragraphs of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version. GB and MTP wrote source code for implementing the benchmarking procedure.

FUNDING
MG and MTP received support through the project Artificial Intelligence for Neutron and X-Ray Scattering (AINX) funded by the Helmholtz AI unit of the German Helmholtz Association. MN is funded through the Center for Advanced Mathematics for Energy Research Applications (CAMERA), which is jointly funded by the Advanced Scientific Computing Research (ASCR) and Basic Energy Sciences (BES) within the Department of Energy's Office of Science, under Contract No. DE-AC02-05CH11231.