iid2022: A Workshop on Statistical Methods for Event Data in Astronomy

We review the iid2022 workshop on statistical methods for X-ray and $\gamma$-ray astronomy and high--energy astrophysics event data in astronomy, held in Guntersville, AL, on Nov. 15-18 2022. New methods for faint source detection, spatial point processes, variability and spectral analysis, and machine learning are discussed. Ideas for future developments of advanced methodology are shared.


Statistical Challenges Arising in High-Energy Astrophysics
The science analysis of data in high-energy astrophysics differs from most fields of astronomy in important ways. The data, typically from space-based observatories, consist of energetic photons counted individually as they arrive in a detector. These datasets often can be viewed in tabular form as a sequence of events with four characteristics: arrival time, location in two-dimensions, and energy. The analysis commonly proceeds in stages: sources are identified in the 2-dimensional image, photons are extracted for individual sources or emitting regions, and 1-dimensional analysis proceeds for the energy distribution and arrival times. These univariate distributions are often complicated: multi-component spectral emission processes are convolved with instrumental sensitivity, and temporal processes can depend on unpredictable variations in accretion onto compact objects. Common analysis procedures include: 1. Individual photons are examined, often smoothed with knowledge of the telescope point spread function, in the image plane; 2. Sparse samples of individual events from faint sources are modeled along one-dimensional energy (spectra) or temporal axis (light curves); 3. Richer samples of events are grouped into bins along the spectral or temporal axis and then subject to statistical or astrophysical modeling. Table 1 summarizes important statistical procedures developed in the highenergy astrophysical community over the past half century. The accomplishments are impressive, but the impact on the research community is mixed. Some methods, such as the Lomb-Scargle periodogram, are widely used, although there may be insufficient appreciation of the challenges of estimating reliable False Alarm Probabilities [VanderPlas 2018]. But other valuable statistical procedures − such as different limits for source existence and flux [Kashyap et al. 2010] and Bayesian estimates of faint-source hardness ratios [Park et al. 2006] − are not commonly used. Many have listened the warning that likelihood ratio tests should not be used near the boundary of parameter values [Protassov et al. 2002], but there is inadequate recognition that likelihood ratios should be penalized by model complexity as with the Bayesian Information Criterion.
There is also a general unawareness within the astronomical community of basic methods that are common in other fields. For example, multiple linear regression for count data [Cameron and Trivedi 2013 (11K citations)] is used extensively in econometrics and other areas, but astronomers often compare a response variable to single covariates in a sequential fashion. Aperiodic stochastic temporal behaviors (that might arise from accretion processes or magnetic activity) are analyzed using Fourier methods designed for periodic time series rather than autoregressive modeling [Box et al. 2015 (63K citations)].

The iid202Workshop
These issues motivated the workshop iid2022: Statistical Methods for Event Data Illuminating the Dynamic Universe workshop, held in Huntsville Alabama on November 15-18, 2022. The spirit of the workshop was to give the participant an opportunity to review and learn about certain statistical methods, and also make presentations based on their own research. Accordingly, the eight sessions had introductory talks by more senior scientists, followed by oral presentations by students and early-career scientists. The National Science Foundation provided support for twenty students and early-career scientists to attend the workshop, via a grant issued to the University of Alabama in Huntsville. Such support was essential to attract students who would not otherwise have had the opportunity to attend. Table 1 lists presentations made at the workshop. The vast majority of attendees were astronomers, with a few notable exceptions such as Prof. Dale Zimmermann of the University of Iowa, who gave the keynote lecture, and biostatistics graduate student Jesus Vasquez from the University of North Carolina at Chapel Hill.

Past Accomplishments in Methodology
High-energy astronomy has its roots in the study of cosmic rays on mountaintops during the 1930s and the discovery of X-rays from the solar corona during the 1950s [Rossi 1948, Tousey et al. 1951]. The first detection of X-rays outside the Solar System involved a few thousand counts from the Galactic Plane obtained during a brief rocket flight [Giacconi et al. 1962]. Early analyses involved simple statistical procedures such as the running mean [Bowyer et al. 1964] or (mathematically incorrect) least squares procedures applied to Poisson distributed data. The first use of the Poisson distribution to derive a cosmic source flux upper limit appears to be by Hearn (1968). As satellite observatories replaced sounding rockets, more specialized statistical procedures began to emerge and accelerated in the early 21st century. Table 2 lists some of the important milestones classified by the scientific problem addressed. Some methods have had very broad impact with over a thousand citations by later studies. Altogether, the development and promulgation of analysis methods has been substantial and often quite successful.
In addition to procedures developed by practitioners within the field, methods for astronomy have been adopted from the wider arena of statistics. In early years, the textbook Data Reduction and Error Analysis for the Physical Sciences [Bevington 1969] promoting least squares procedures had the greatest impact, not least because it included convenient Fortran codes that could be typed into IBM cards and used on main frame computers. It was largely supplanted by  [Bonamente 2022]. Bayesian inference has become an important tool for modeling astronomical data as treated in texts like [Hilbe et al. 2017] and [Bailer-Jones 2017]. However, neither the classic works nor the newer volumes emphasize low-count rate problems as encountered in high-energy astronomy. Some require a basic knowledge of probability and statistics, and this can limit their diffusion among astronomers who are often missing such courses in their undergraduate education.  Time domain methods for X-ray and gamma-ray astronomy [Feigelson et al. 2022] Matching Bayesian and frequentist coverage probabilities when using an approximate data covariance matrix [Percival et al. 2022 ] The denoised, deconvolved, and decomposed Fermi γ-ray sky. An application of the D 3 PO algorithm [Selig et al. 2015] Studies in Astronomical Time Series Analysis. VI. Bayesian Block Representations [Scargle et al. 2013] Change-point Detection and Image Segmentation for Time Series of Astrophysical Images [Xu et al. 2021] Table 3 lists a few of the methods discussed in the iid2022 workshop that are directly relevant to high-energy data and science analysis. Software implementation are combined with methodologies to allow quick implementation. In some cases, such as Baddeley's book for analyzing Poisson images and variability detection procedures discussed by Feigelson, the codes are already available in the general purpose R statistical software environment. In other cases, such as Scargle's Bayesian Blocks and Xu's multidimensional change-point analysis, codes are written specifically for use in X-ray and γ-ray astronomy.

Looking Towards the Future
Presentations at the iid2022 workshop demonstrate that the development of innovative procedures for analyzing high-energy astronomical data is proceeding in a vibrant fashion. But there are considerable difficulties in promulgation of new methodology in the research communities. We outline here challenges that can be readily identified and suggest directions for improvements for the coming years.
Statistics Education One of the main needs in high-energy astronomy is a more rounded background in statistics for its practitioners. Most graduate degrees leading to an advanced degree in astronomy or astrophysics have no requirement of statistics courses, and are often limited to a course on 'data analysis methods' that lacks a foundation on statistical principles. Astronomers should be familiar with differences between nonparametric hypothesis testing and parametric modeling, Poisson and Gaussian distributions, least squares and likelihood based modeling, and stationary and nonstationary processes. Wavelet transforms, local regression, autoregressive models, and Fourier approaches to time series analysis should be taught.
As both authors and teachers, it is our opinion that the typical high-energy data analyst should have a background that includes at least one undergraduate course using a statistics textbook such as Probability and Statistical Inference [Hogg, Tanis and Zimmerman 2023]. Such background would be beneficial to understand in detail the main statistical methods available, while giving the basic tools to undertake more complex tasks such as developing new statistical methods. At the graduate level, a course in methodology using textbooks like Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data [Ivezić et al. 2019] and Modern Statistical Methods for Astronomy with R Applications [Feigelson and Babu 2012] should be widely available in astronomy departments.
Integrate statistics into high-energy mission projects High-energy astrophysics missions have traditionally included costs for 'software development' to write pipelines for processing telemetry data through Level 1 and Level 2 data products. But it is also important to fund, at the early stages, study of methods to be implemented in the pipeline and off-line science analysis by individual scientists. Methods as simple as maximum-likelihood analysis of count data [Cash 1979] and as complex as information theory for gamma-ray astronomy [Enßlin 2019] and 4-dimensional change-point analysis [Xu et al. 2021] should be considered.
Centralized facilities like NASA's High Energy Astrophysics Science Archive Center and ESA's European Space Astronomy Centre should institute organized procedures to evaluate newer methodologies and bring them into their code libraries for use by the research communities. Some methods can be incorporated into important existing software tools such as XSPEC [Arnaud 1996] and SPEX [Kaastra 1996], while other methods would be stand-alone codes added to libraries such as HEASoft. Documentation and tutorials for training community scientists in methodology should accompany software releases.
Funding for methodology For two decades starting in 1990, NASA's Science Mission Directorate had an Applied Information Systems Research program that included development of statistical tools, machine learning procedures, computational methods and algorithms for astronomical missions. But this program has changed focus and there is now no avenue for the research community to obtain funds for the development of new methodology for highenergy astrophysics. A program is needed similar to NASA's Earth Science Division's Advanced Information Systems Technology Program that includes development of advanced tools for data and science analysis. Several White Papers were submitted to the National Academy of Science Astro2020 Decadal Survey arguing for improved funding in astrostatistics and astroinformatics for all branches of the field.
Attitudes towards advances in methodology A major reason for the slow advancement in usage of advanced − or even statistically acceptable − statistical methods in high-energy astrophysics is absence of penalty for inaccurate or misleading analysis methods. This includes review during mission planning, individual observing proposals, and the final published astrophysical literature. Sometimes forces lean towards mundane analysis procedures: authors who present advanced statistical methods in an astrophysics paper might encounter a reviewer poorly prepared in statistics. The journals of the American Astronomical Society now have a Statistics Editor, and reviewers expert in statistical analysis can be sought in addition to a reviewer expert in the scientific topic. A two-reviewer process is common for journals like Annals of Applied Statistics and Journal of Applied Statistics. The high-energy research community that widely encourages improvements in telescope and detector capabilities should also encourage improvements in data analysis capabilities that can improve the scientific return from any instrument or observing project.