This article was submitted to Computational Methods in Chemical Engineering, a section of the journal Frontiers in Chemical Engineering
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The amount of sensors in process industry is continuously increasing as they are getting faster, better and cheaper. Due to the rising amount of available data, the processing of generated data has to be automatized in a computationally efficient manner. Such a solution should also be easily implementable and reproducible independently of the details of the application domain. This paper provides a suitable and versatile usable infrastructure that deals with Big Data in the process industry on various platforms using efficient, fast and modern technologies for data gathering, processing, storing and visualization. Contrary to prior work, we provide an easy-to-use, easily reproducible, adaptable and configurable Big Data management solution with a detailed implementation description that does not require expert or domain-specific knowledge. In addition to the infrastructure implementation, we focus on monitoring both infrastructure inputs and outputs, including incoming data of processes and model predictions and performances, thus allowing for early interventions and actions if problems occur.
It has recently been recognized that machine learning and data analytics play a critical role in realizing long-term sustainability goals in the process industry. The European Union 2030’s goals proposed rather ambitious key targets in the
Indeed, industrial digitization is expanding with a relatively fast pace, given that sensors are getting faster, better and cheaper
As it will be discussed in detail in the forthcoming related work section, several researchers have emphasized the need for Big Data analytics in the process industry and cyber-physical systems, cf.,
Indeed configuring and deploying a Big Data infrastructure requires prior expertize and familiarity with these technologies. For this reason, our goal in this paper is to present a
This research has been conducted under the European Union’s Horizon 2020 project
In the remainder of this section, we present related work and the main contributions of this paper.
As pointed out in
The large variety of data volume and types, as well as the different needs of each application domain, may lead to different Big Data solution architectures. Even for the same application domain, there might be several alternative solution concepts and services that could be used. For example, we may use a mixture of technologies, going from NoSQL databases, like Cassandra or HBase, data preparation utilities like Paxata, and distributed parallel computing systems like Hadoop and Spark, as mentioned by
Indeed, there has been a large literature on the design of Big Data management and analysis solutions, including, for example, the works of
Other relevant work, especially in the context of the process industry, includes the paper of
Summarizing, we may argue that one common characteristic of the aforementioned literature is the fact that implementation aspects and details of the proposed technologies and their orchestration are not discussed in detail. In a way, expert/domain knowledge is still required for implementing such Big Data solutions, rendering this problem an extremely challenging task for non-experts.
As we discussed in the previous section, current literature offers a broad range of Big Data solutions, also within the context of the process industry. However, either they mostly focus on one part of the infrastructure without offering complete infrastructure compositions, or, even if they do, expert knowledge is still required for implementing the proposed solutions. For this reason, in this paper, our goal is to provide a unified and comprehensive Big Data Management Schema for tackling a broad range of machine learning tasks in industrial processes and with possibly heterogeneous data sources. A detailed presentation and discussion over the selection of the used technologies and their configuration/orchestration are also provided, thus significantly reducing the level of expert knowledge required. This way, the presented schema can easily be reproduced and adapted to various use-cases. In particular, our contributions can be summarized as follows: 1. We provide a detailed insight into state-of-the-art technologies for building a big data infrastructure; 2. We propose and describe the structure of 3. We use open-source tools for the structure of AVUBDI and enable a user-friendly environment for non-experts as well as experts; 4. We describe in detail the data flow and types within the proposed AVUBDI; 5. We describe in detail the monitoring approaches for industrial plants using AVUBDI.
In this section, we first discuss already existing approaches and state-of-the-art technologies for Big Data management. We also provide a comparative analysis of these methodologies with respect to their versatile usability. This presentation is organized into four topics, namely data management, storage, analysis and visualization. In the second part of this section, we present the structure and implementation of the proposed versatile usable Big Data infrastructure.
This subsection covers various state-of-the-art technologies needed for the implementation of a Big Data infrastructure, ranging from data management to visualization tools. As multiple technologies exist for various parts of the infrastructure, the advantages and disadvantages are discussed regarding our setting for a versatile usable Big Data infrastructure in process industry.
As it is difficult to deal with high amounts of data in various and varying formats in traditional database systems, data lakes are likely used instead. Data lakes used for data storage are more complex to handle as they may store unprocessed data in its raw format or unstructured data according to
The management of data is crucial in Big Data infrastructures as it is not feasible to process Big Data in one single application due to timely or even computational restrictions. Instead, it is reasonable to create various processing steps, which exchange messages with the needed data to avoid bottlenecks, to increase performance and to provide the possibility for extensions and flexible adaptations. We refer to data management in our setting as 1) the gathering of data using specific sources and 2) the message routing within the infrastructure as well as 3) data or message format specifications for further processing steps. Especially for process industry, it is important to create a flexible data management as it is likely that process paths and/or information differ over time due to changing requirements.
Apache Kafka.
The next important step, after the definition of a message routing tool, is the assurance of an appropriate data format that is passed between different processing steps. It is necessary to check the message content as changing environments in industrial settings may lead to changing content, e.g., new features are added, and existing processing steps may have problems handling the new data format. For this reason, Confluent developed a tool called Schema Registry, which manages message schemas and is directly connected with Apache Kafka for the surveillance of schema evolution. It provides a serving layer for metadata, covers a RESTful interface for storing and retrieving Avro®, JSON Schema, and Protobuf schemas and is able to hold the whole infrastructure in a consistent state according to
In most use-cases data has to be stored either temporarily or permanently into a database. The stored data represents a broad variety, including data that is later used for further processing, historical data, and results from prediction models. Various database approaches also exist for different use-case requirements. Especially in the field of Big Data, it is necessary to select the most appropriate technology, given that several database technologies are excluded due to their poor performance with respect to capacity and access speed. Cassandra.
Those three databases have overall very strong advantages that address rather well the challenges met in Big Data applications. In order to choose the best one for an industrial application may not be a simple task. On the one hand, Cassandra is easily scaleable and can be distributed across multiple servers, but it lacks the ability of flexible schemas, which InfluxDB offers. CrateDB incorporates both advantages of the distributed approach and the flexible schemas, but it still has several issues regarding its implementation and its throughput.
Another important topic during the development of a Big Data infrastructure is the selection of the most appropriate data processing and analysis tool. The integrated tool has to be able to deal with large amounts of data in a near real-time manner and provide fast answers regarding analytical evaluations or predictions. This is especially important in industrial settings so downtimes of machines, poor quality of products or emissions during production are minimized. Fulfilling such goals will enable industries to improve their performance indicators.
Popular tools for data processing and analytical evaluations are Apache Storm.
Apache Storm is a real-time distributed processing system, which can process the streams of data fast, while it still provides easy usage. It is highly scalable, offers low latency with guaranteed data processing and allows developers to develop their logic virtually in any programming language according to
Apache Spark is a next generation engine for Big Data analytics and alleviates key challenges of data preprocessing, iterative algorithms and interactive analytics among others. The data can be processed through a general directed acyclic graph of operators using rich sets of transformations and actions. Apache Spark supports a variety of transformations and eases the data preprocessing especially for Big Data. Furthermore, Apacha Spark provides an adapted library of machine learning algorithms for faster performance, the so called MLLib,
Apache Flink is an open-source system for processing streaming and batch data, where real-time analytics are also supported. It includes continuous data pipelines, historic data processing, a. k.a. batch processing, and fault-tolerant dataflow pipelines. Similar to Apache Spark, Apache Flink also provides its own high-performance machine learning library called MLFlink,
Apache Samza is a distributed system for stateful and fault-tolerant stream processing. It is able to scale to massive state sizes, e.g., hundreds of TB, due to the use of partitioned local state in combination with a low-overhead background change-log mechanism. Next to the processing of infinite data streams, it also allows finite datasets as a stream, e.g.,
Apache Drill is a distributed system for interactive analysis of large datasets, which is designed to handle Petabytes of data spread across numerous servers. Its goal is to respond to queries in a low-latency manner and it is designed for scalability, including well-defined APIs and interfaces,
Comparison of analytical tools for Big Data. The categories are chosen upon influence within a Big Data system and range from “ ++ ” (very good fulfillment) to “–” (not fulfilled). Additional information is given according to the requirements.
Features | Storm | Flink | Spark | Samza | Drill | Hadoop |
---|---|---|---|---|---|---|
Real-Time | + | ++ | ++ | ∼ | + | ∼ |
Distributed Processing | + | + | + | + | + | ++ |
Running Analytics | + | ++ MLFlink | ++ MLLib | + | + | + |
Streaming Type | + Micro batches | ++ Event streaming | + Micro batches | + Micro batches | Mini batches | ∼ Batch and mini batches |
Latency | ++ | ++ | + | + | + | Component |
Throughput | ++ | ++ | + | ++ | + | Component |
Fault Tolerance | ++ Auto-restart | ∼ Checkpoint | ∼ Checkpoint | ∼ Checkpoint | ∼ | − Replication |
Message Delivery | ++ Exactly once | ++ Exactly once | ++ Exactly once | − At-least once | n.A | Component |
Documentation Community | + | − Small | ++ Big | ∼ Medium | + | ++ Big |
Data sources | + | + | ++ HDFS, kafka | ++ kafka, kinesis, … | ++ schema-free | Component |
Scalability | ++ auto-scaling | ++ auto-scaling | + | ++ | + | Component |
The selection of the right tool is crucial. When focusing on industrial plants with high amount of equipped sensors that are producing a continuous data stream, it is reasonable to look for tools supporting Event-Streaming or Micro-Batches. Furthermore, to train prediction models based on historical data, it is also recommended that the analytical tool supports both machine learning and also batch processing. Using the comparison in
To get a good overview on the running process, it is useful to integrate a visualization tool into the Big Data infrastructure. Various approaches are possible to implement visualizations, e.g., using R Shiny apps or self-built Python apps. Nevertheless, there are also some ready-to-use tools for basic and advanced visualizations. Four popular tools in this area are PowerBI.
PowerBI is a versatile platform for analyzing and visualizing data within live dashboards and reports, aiming at non-technological users. It supports a variety of sources, e.g., databases or files, and is based upon business analytics. PowerBI has a restricted free usage that is expandable by purchasing licenses. Similar to PowerBI, Tableau is also not free, can be used without code and is also used broadly in the business analytics field. It is very fast at processing Excel files and data groupings, but lacks the ability to process complex needs, various sources of data and a powerful query builder. Grafana and Chronograf both are free tools for visualizing data, focusing on time-series data stored in databases. They are simple to use for developers with SQL skills since queries are created directly by the developer himself. This enables the opportunity for more elaborate and joined queries to visualize complex connections within data. Chronograf supports as source database InfluxDB and provides multiple visualization types, e.g., line plots or gauges, and enables the development of unique dashboards. Grafana is similar to Chronograf, but it supports more databases and also enables the integration of additional visualization plug-ins.
All of the mentioned tools work on top of the infrastructure and use data stored either in databases or files. They are not directly connected with message routing and only represent information previously stored, even though the tools support live updates. Live updates denote the immediate forwarding of newly inserted information to the dashboards to be processed and visualized. The final selection of the visualization tool should be in coordination with the dashboard developer. If non-technical people are creating or adapting dashboards, it is reasonable to use PowerBI or Tableau. With these tools, it is possible to create insights into data with the downside of less advanced visualizations. The usage of Chronograf and Grafana enables further elaborate visualizations and queries on databases to provide more insight into existing data and is therefore a good starting point. For more complicated or not supported visualizations, it is recommended to develop own applications using, e.g., R Shiny.
Orchestration and life-cycle management of the different used services is crucial but often pose a central problem. In this context, containerization gained a lot of research interest over the last few years,
In
Comparison of orchestration technologies for Big Data infrastructure development and deployment. The feature categories are chosen with respect to the identified necessities in the infrastructure stack.
Features | Docker swarm | Kubernetes |
---|---|---|
User interface | 3rd party | Yes |
Scalability | Highly (manual) | Highly (automatic) |
Complexity | Low | High |
Logging | Yes | Yes |
Environment | Development | Production |
# Containers | Small (<100) | Big (>100) |
This sub-section covers used technologies for the development and structure of a versatile Big Data infrastructure. The technologies selected, which were presented earlier, are further described with a special focus on why they are fitting best to process industry Big Data infrastructures. In addition, tools for the support of monitoring the complete infrastructure during and after the development stage are presented to gain further insight into the dataflow.
Apache Kafka is the core for routing batch and streaming data as messages between different (processing) steps across our infrastructure. Among the aforementioned technologies in
Another advantage of using Apache Kafka is provided by pre-configured and customizable Kafka connectors, which are used for the gathering and storing of messages. Those connectors enable a quick development of source and sink connections between Apache Kafka messages and other systems, e.g., databases such as InfluxDB or Cassandra, and reduce development time and potential errors. To handle the format of messages between (processing) steps in the infrastructure, the Confluent Schema Registry is used. It is able to interact with Apache Kafka for monitoring the schema and its evolution over time. Next to its communication with Kafka, it is also able to handle schemas provided to Kafka connectors for storage issues. The importance for the integration of such a tool is given by the likely changes in process industry, e.g., new sensors are added or a process path changes. For the general coordination of processes in our distributed applications, Zookeeper is used. It provides a simple integration with Apache Kafka and other services and therefore reduces integration problems. Zookeeper is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
One to enable further usage of data, it is advisable to store it in a persistent analytical storage. The data stored ranges from raw data gathered by machines to (pre-) processed data within the infrastructure and predictions or decisions. A data storage provides reusability of data for further evaluations, visualizations and the training of new models offline using historical data. A versatile infrastructure has to deal with varying amounts of data and therefore it is the best choice to focus on data storage technologies, which are developed for the usage in Big Data scenarios. As it is likely that access to recently added data is more important, e.g., for analytical purposes, than older data, we selected InfluxDB as our main storage. InfluxDB provides, in contrast to Cassandra, a schemaless database and is developed on the basis of time-series data. The schemaless aspect is rather interesting as it allows to store evolving data, e.g., if new sensors are added to machines.
Apart from storing the data, it is also required to enable the storage of models within the infrastructure. This topic was not mentioned in
For data processing, analysis and predictions, the versatile usable Big Data infrastructure is using Apache Spark. Apache Spark is the best fitting tool out of the one’s presented in
For the visualizations at the top layer of our infrastructure, we decided to use Grafana. As all parts of the infrastructure are in general open-source, it is a good idea to also include an open-source visualization tool as Grafana or Chronograph. They also provide advanced possibilities compared to PowerBI and Tableau, e.g., due to their customizability and ability to present complex relations using queries. Furthermore, as process industry is likely to work with time-series data, Grafana and InfluxDB are the most reasonable to include in a versatile usable Big Data infrastructure. Both tools are very similar to use as they receive their data directly from databases and visualize information using an SQL statement for data selection and aggregations. The visualization tool can be easily exchanged as it is atop of the infrastructure and is only connected directly to the database(s).
For the monitoring during the development of the infrastructure, and also for later checks, various tools are used for the structure’s components. On the level of message routing, Kafdrop is used to monitor messages passing through the Kafka cluster. It is able to display information regarding brokers, topics, partitions, topic consumers and enables us to take a view on messages and their content. The Kafka Connect UI is used to set up and manage Kafka Connector instances, which represent Kafka sink and source connectors as well as transformation connectors. It already offers a broad variety of sink and source connectors, e.g., for InfluxDB or Cassandra, but also accepts customized implementations of connectors. For the monitoring of message schemas, the Schema Registry UI is a good choice. It is a fully featured tool for the underlying Schema Registry that allows visualization and exploration of registered schemas and their evolution. For the monitoring of databases, Adminer is used. This database management tool supports various databases and allows insights into the database structure, including tables and their contents, and provides a possibility for the execution of queries.
Docker, as covered in
Depending on the use-case, infrastructure scaling might be a necessity. In combination with Docker Swarm, the containerized architecture allows easy scaling of infrastructure services by adding host systems. It is important to note that the used services bind to specific ports of the host system, leading to an issue when trying to scale up services like Kafka, Zookeeper or Spark automatically without increasing the number of separate host systems. It is necessary to manually configure the cluster members of the mentioned technologies as a separate service definition in the Docker Compose file (see
This subsection covers the implementation of AVUBDI, including a general overview of the composition of the infrastructure and an insight into the data processing pipeline.
Big Data Management Schema–Infrastructure layout containing main services and information flow.
The
The
The
The
Technology pipeline.
Example docker-compose service definition for standalone Kafka service.
All docker containers are configured using YAML files, which allows for establishing the overall definition and orchestration of these services. YAML files contain configuration definitions for the dockerized services, the containers and the connections to other containers. In
In this subsection, we would like to describe the necessary steps for deploying the previously described technology stack in our test environment. It is worth mentioning that the selected Docker platform simplifies deployment of the required technologies and orchestration of their operation. In particular,
Step by step instructions for installing and deploying the
1. Prerequisites |
A. CentOS 8 |
B. Internet connection (for pulling docker images and git project repository, but later should be done internally) |
C. SFTP connection |
2. Installation |
A. Install the yum-utils package (which provides the yum-config-manager utility) and set up the stable repository |
I. sudo yum install -y yum-utils |
II. sudo yum-config-manager --add-repo |
B. Install latest version of docker engine and containerd |
I. sudo yum install docker-ce docker-ce-cli containerd.io |
C. Start docker |
I. sudo systemctl start docker |
D. Install docker compose |
I. sudo curl -L “ |
E. Make docker compose binary executable |
I. sudo chmod + x/usr/local/bin/docker-compose |
F. Pull git project repository |
I. sudo git clone |
G. Switch to project directory |
I. cd/AVUBDI |
3. Deploy |
A. Deploy infrastructure standalone on one hosting VM (standalone) |
I. Deploy infrastructure services docker compose up -d --build |
B. Deploy infrastructure cluster on multiple hosting VMs (swarm cluster) |
I. Open ports (2376,7946 TCP and 7946,4789 UDP) in firewall to allow docker swarm communication across different VM nodes |
II. Setup swarm with hosting VM as swarm manager (returns |
III. Scale swarm by adding additional VM nodes: docker swarm join < |
IV. Set roles of swarm nodes: docker node update --label-add role = <master, stream_processing, batch_processing or analytics> |
V. Deploy infrastructure services to swarm (run on every swarm node): docker stack deploy --compose-file docker-compose.yml cogniplant |
In this subsection, we provide additional necessary information for configuring certain parts of the
The first point of configuration is the data ingestion of the
For the persistence of, e.g., results or raw data, Kafka sink connectors are used.
Configuration of a Kafka Sink Connector for a local InfluxDB.
{ |
“Name”: “influxDBSinkConnector”, |
“config”: { |
“connector.class”: “io.confluent.influxdb.InfluxDBSinkConnector”, |
“measurement.name.format”: “Results”, |
“influxdb.url”: “ |
“topics”: “storeInInfluxDB”, |
“tasks.max”: “1” |
“value.converter.schemas.enable”: “true”, |
“value.converter”: “org.Apache.kafka.connect.json.JsonConverter”, |
“influxdb.db”: “database_1” |
} |
} |
Schemas, which are used to check data conformance and evolution in routed messages, are defined in the Schema Registry.
Configuration example of a schema in the Schema Registry.
{ |
“type”: “record”, |
“name”: “Schema_1” |
“namespace”: “at.cogniplant.schemas”, |
“fields”: [ |
{ |
“name”: “Value_1” |
“type”: “string” |
}, { |
“name”: “Value_2” |
“type”: “string” |
}, { |
“name”: “Value_3” |
“type”: “double” |
} |
] |
} |
Data processing is conducted within Spark Jobs, which are deployed on the Spark master. The Spark Jobs are separated according to their usage to offline or online Jobs. Offline Jobs cover the generation and training of new models and require higher computational resources. Models created in this environment are using historical data stored, e.g., in InfluxDB and use cross-validation with a training/testing partition of 80/20. Online Jobs are used to process a continuous flow of data for, e.g., predictions.
Spark Jobs are defined using scripts, e.g., written in Scala or Python. The source of information is either a Kafka source connector (offline) or the subscription to Kafka topics for the continuous gathering of data (online). The received data is converted into a Spark Dataset object and reduced to its main features, e.g., removing duplicated or non-informational columns. The filtered data is separated into feature and target columns using a Spark Vector-Assembler and then further processed according to its usage. Trained and adapted models are stored together with their preprocessing pipeline as well as experimental results in MLFlow so processing steps are reproducible. Results of online Jobs are stored
For the deployment of Spark Jobs within a Docker container, the Job has to be saved as fat jar (Scala, Java) or script (Python) inside the data volume. Further operations, e.g., submitting the job, have to be conducted inside the container for successful deployment. In this case, the creation of a script for automatized submissions is useful to avoid overhead in the development.
The configuration of data visualizations is dependent on the tool and use-case. Grafana and Chronograf are both able to provide dashboards for various use-cases, each of which is configured independently. The configuration is done by adding panels inside of dashboards, in which users are able to define queries for data retrieval out of the database and visualization types. Queries are directly sent to pre-defined database connections, without the need for direct integration into the infrastructure. It is possible to define queries by hand or concatenated
This section covers various use-cases utilizing our infrastructure in different settings with different targets. The goal was on the demonstration of the functionality of the tool. The specifics of the use-cases considered are not relevant since they are used merely for demonstration purposes of 1) the monitoring of process parameters and 2) predicting and modeling performance surveillance. Those three use-cases are just a sample of usage possibilities for such an infrastructure in the process industry. Further possibilities include the integration of alarm systems, simulation of new scenarios or analysis of machine behavior using process mining. For the experiments, we used a virtual machine with the operating system CentOS Linux, an Intel(R) Xeon(R) Gold 6136 CPU @ 3.00 GHz (4 cores of 12 were used for the VM), 16 GB RAM and 128 GB disk space, 50% of which is used for the services in docker.
Even though the presented experiments have been conducted using the CentOS Linux operating system, it should be noted that Docker runs natively on both Linux and Windows. However, native Windows- and Linux-based containers have to be individually configured. For simplicity and consistency, we focused solely on the deployment of Linux containers. It is important to note that for seamless Linux container stack development on Windows platforms, e.g., developer systems, the Windows Subsystem for Linux (WSL2)
The first scenario covers the monitoring and visualization of process parameters during production. The data is recorded by various sensors during an industrial process, gathered by Kafka Connectors as mini or micro-batches in near real-time and stored in its raw format next to further preprocessing. The storage of raw data is essential for fault tolerance, replication possibilities and future analysis tasks. Preprocessing steps of raw data cover the filtering of interesting data, flattening of nested structures, aggregations and the transformation into usable data formats, e.g., Spark Dataset. The preprocessed data is stored in InfluxDB to respect its time-series characteristics and simpler handling of it regarding further analysis. Grafana dashboards are used for the visualization of process parameters, see
Visualization of recorded process parameters in Grafana using a stacked bar plot. The features of the mock-up scenario resemble temperature measurements in different temperature zones recorded over a month. Users are able to identify measurement stops as well as correlated sensor data.
The second use-case scenario of AVUBDI covers the monitoring of predictions using a machine learning model for a process criterion, e.g., quality of a product or temperature. Again, the feature data is recorded by industrial sensors, gathered, stored raw and further preprocessed. A machine learning pipeline is used for the prediction or decision part of the infrastructure. The pipeline receives the data • Data cleaning, e.g., handling of null values, • Feature filtering/selection based on training data, • Formulating predictions/decisions using trained models, • Storing the results in InfluxDB.
The machine learning models have been previously trained on historical data and stored as a pipeline for simple online prediction/decision-making. The processing of streaming data takes place in the online part of the infrastructure to provide insights into the target’s current development.
Visualization of model predictions in Grafana including line plot, textual representation in table and gauge with thresholds. The predicted value of the mock-up scenario represents the temperature prediction within a machine hall. The gauge indicates whether the temperature is within a suitable range using the mean predicted value of the past 10 min.
The third use-case scenario covers the monitoring of the model regarding its performance over time. The visualized data consists of (preprocessed) data gathered in industrial environments of a process and prediction values as well as measured, real values of the predictions for comparisons. In industrial environments, it is not always possible to measure product or process criteria, which makes predictions necessary to get an overview on it. It is very likely though, that values are measured in predefined time intervals to check if the prediction models are (still) fitting. This use-case scenario is a mixture of the previous ones, whereat for each predicted value, a real value is available.
In this scenario, we want to check if the model performance is still suitable or if a model has to be adapted/a new model has to be trained. The dashboard created in Grafana, see
Model performance visualization in Grafana. This mock-up scenario covers the visualization of predicted and real values when estimating the performance of a prediction model. The performance surveillance is necessary to initiate changes prior to arising problems and indicates whether amodel still fits or it has to be adapted.
In this article, we have presented a versatile usable big data infrastructure (AVUBDI), which is able to handle the full-stack of historic and real-time data processing, i.e., gathering, transforming, analyzing and visualizing in a user-friendly manner. Our main goal is the utilization of AVUBDI to improve the supervision of the production, through which performance indicators can be improved. In the first part of this paper, various open-source state-of-the-art technologies for data management, routing, storage, processing and visualizations were presented and compared. The most promising of those technologies were used to describe the development of AVUBDI in more detail, where we focused on both the technology stack and the data pipeline. Topics ranging from data gathering
The presented big data infrastructure can easily be deployed and adapted to various use-cases in industrial environments due to its versatile and solid structure, focusing on a user-friendly environment for non-experts as well as experts. In addition, the usage of containerization enables a straightforward scaling and management of the infrastructure services. On the other hand, containerization slightly reduces the overall achievable performance due to additional software and networking layers on top of the used services, see
The proposed Big Data infrastructure is publicly available at the following link:
LS and MM have contributed equally to the design and implementation of the AVUBDI infrastructure as well as the scientific research and the documentation process. GC has contributed partially to the design of the AVUBDI infrastructure, the scientific research and the documentation process. MP has contributed partially to the scientific research and the documentation process.
This paper is supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 869931, project COGNIPLANT (COGNITIVE PLATFORM TO ENHANCE 360° PERFORMANCE AND SUSTAINABILITY OF THE EUROPEAN PROCESS INDUSTRY). It has also been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center SCCH.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.