About this Research Topic
For non-critical applications, approaches based on, e.g., neural networks encompassing multiple high-dimensional layers of hidden nodes can yield valuable outcomes even in scenarios of high scientific complexity. Thus, replacing time-consuming operations with quicker, simpler and, most importantly, automatic AI strategies based on Machine Learning (ML) from BIG DATA may be tempting from a practitioner's perspective.
Nonetheless, despite extensive R&D efforts, AI still exhibits severe drawbacks that primarily concern its black-box nature: most AI models, indeed, are not readily interpretable and cannot provide insights into the causalities influencing the systems under study, which, in a certain sense, jeopardises their reliability, their robustness and their validity.
These limitations can be alleviated by coupling techniques already established in the field of chemometrics into a framework for the hybrid multivariate modelling of high-dimensional data.
Nowadays, multi-response instruments like optical spectrometers, hyphenated chromatographs or hyperspectral imagers are commonly used for characterising samples, objects or scenes affected by known and unknown phenomena that somehow need to be identified described for a comprehensive understanding of chemical, biological and biochemical issues.
On the one hand, traditional complex modelling (i.e., theory-driven mechanistic mathematical modelling via, for example, well-understood differential equations) may lead to a dysfunctional over-simplification of reality: it can only be utilised for extracting information about known or expected events, while unmodelled interferences might create serious alias errors.
On the other hand, pure soft modelling (data-driven ML modelling, typical of today's AI) can account for both known and unknown causalities (being, in principle, effective for handling the quantitative BIG measurements returned by modern instrumentation) but only as long as they are observable already in the analysed recordings. Usually, it does not allow a priori knowledge to be directly exploited in the analysis stage, it requires very informative (not just many) training data, and it gives solutions that are difficult to explain.
Fusing these two strategies would enable what we previously defined as hybrid multivariate modelling, a low-complexity modelling scheme guaranteeing interpretability and explicability. Three aspects justify the need for such a combinatorial scheme:
1) Modern instrumental platforms are usually very informative, but the raw data they produce have serious selectivity problems because they are influenced by several sources of variability, wanted or unwanted, at the same time;
2) These systematic variabilities are almost always induced by various laws of nature, whether we know them or not. Variation patterns due to known phenomena can be handled by explicit theory-driven modelling. Instead, data-driven ML could deal with variable patterns associated with unknown phenomena.
3) In many domains, the number of these expected and unexpected phenomena or events is limited. This reduces the number of known causalities to be described and the number of unknown ones whose effect needs to be discovered and described based on the collected measurements.
With this article collection, we, therefore, aim at stimulating the development of a human-interpretable machine learning with an eye for causalities: an eXplainable AI (XAI) that fully merges "hard" deduction and "soft" induction by combining prior theoretical knowledge and experience with statistical pattern recognition in massive measurement streams. This will provide a more rational ground for simpler, cheaper, more reliable and more understandable data science solutions.
Here is a tentative list of chemometric tools of potential relevance for this purpose:
• Cost-effective design of experiments for the smart selection of model training objects to ensure that all expected sources of variation are systematically explored. This should be supplemented with a random, but representative and relevant sampling of sufficient statistical power to reveal also unexpected variation types;
• Disentanglement of known and unknown events influencing the variation of quantitative BIG DATA by fast and robust methods for the assessment of overwhelming streams of multichannel measurements:
o Quantitative BIG DATA simplification by knowledge-driven pre-processing – response linearisation according to known physicochemical laws, followed by quantification and compensation of expected physical, chemical and instrumental variability patterns;
o Comprehensive discovery, characterisation and quantification of unexpected but systematic variability patterns;
o Detection and identification of non-random anomalies in the registered data.
• Alias error reduction by hybrid model post-processing;
• Low-dimensional subspace modelling of high-dimensional data for improved graphical exploration;
• Critical and creative model assessment by statistical validation and visual inspection.
We hereby call on colleagues in chemometrics and other multivariate data modelling fields to write a collection of research articles about methods from chemometrics that can enhance contemporary AI and overcome some of its shortcomings. We also welcome comparisons of AI and chemometric approaches, as well as demonstrations of methods that chemometrics could and should "borrow" from the AI community. We want to focus on BIG DATA generated by multichannel or imaging instrumentation.
Keywords: Chemometrics, AI, Big Data, Machine learning, Computer Science, Matemathical modelling
Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.