The nature of Industrial data

The devil is in the data.

Industrial data is known to be one of the most complex data forms. We experienced it first-hand building a Near Real-Time (NRT) Industrial IoT, contextual data streaming, and processing engine. 

We’re sharing our perspective of what constitutes industrial data, what makes it complex, and how we built a platform overcoming these challenges and providing an abstract seamless view ready for consumption. 

There are 5 main properties of industrial data 

  1. Dispersed Data
  2. Data Type
  3. Detail
  4. Structure
  5. Size

We will visit each of them in detail below.

Dispersed Data

This is data stored or streaming in multiple locations from multiple ways and formats. There are three main kinds of data in the industrial world – manufacturing data, ERP or Enterprise data, and ecosystem data.

Manufacturing data contains the present and history of the states of the machines (also called assets). This can be abstracted out to represent a process that is performed using a set of machines working in a fashion.

The manufacturing data is often the most structured data and can be ingested using one of the two ways

  1. Streaming this data from a Data Collection System (DCS) like DeltaV, Asset Framework like OSI Pi or PLCs, etc. This is the real-time stream of sensor data representing the process or condition or other variables. This could be happening in one of the 400+ ways of Industrial Connectivity including some known names like MQTT, OPC, OSI, ABDF1, Modbus, etc.
  2. Ingesting data from historians and data lake. Most industries keep historical data for many years for auditing and compliance purposes. Machine Learning applications are giving a new use case of this data for training models or finding previously unknown insights.
    This data is present in various kinds of historian systems or other data lakes or data warehouses and needs to be fetched efficiently and quickly for visualizing, statistical analysis, or machine learning purposes.

Enterprise or ERP data often contains meta-information about the process. For ex – a batch number that gets assigned. This data is necessary to create a full context and is fetched using one of the Enterprise connectors like SAP Hana, IBM, and Salesforce.

Enterprise data is typically semi-structured, and the degree varies from one enterprise to another. Every enterprise is at its own stage in data maturity. The level of data maturity typically defines how easy or complex it is to fetch meaningful data from their existing Enterprise Systems or data lake.

There are many other software tools used for very specific purposes like a Quality Management System (QMS) or a System of Records (SoR). They are ecosystem software, as they facilitate the necessary compliance, or bridging between various processes. Most often, it is needed to have a bi-directional communication channel with this ecosystem software, using the connectors or APIs they expose.

The data obtained by any means is not necessarily in a format that can readily be ingested or even processed. Almost all the time, certain cleaning and standardization are required before or during the ingest. This step is also specific to the enterprise, team, or particular process.

Data Type

It is easy to think that most of the sensor data that is captured is numeric, for example – temperature, pressure, etc. There are many other types of data – spectrometer, for example, emits spectral data per sample, and infrared and regular cameras have video feeds.

There are data fields like batch and lot number which could be alphanumeric as well. This diversity of data type has to be accounted for at all times, which makes it interesting because they have different needs – memory, compute, storage, and use cases.

For instance, storing a temperature value at an update frequency of 500ms for 24 hours take 1.28 MB. While storing a spectral probe (with 1000 wavelengths) at the same update frequency and time period is 1536 MB (1200x). (How we’re storing them is another blog post, later).

There is a similar increase of 10x – 100x in processing times of spectral data and video data as compared to numeric data types. This difference is important, because, these data represent the same event and in order to make meaning out of the event, the processing from both data types must be complete and analyzed in union.

A key requirement, mostly for Manufacturing companies is GxP compliance. GxP is a Good Manufacturing Compliance, and amongst many things, it means that the data is exactly the same and the results are predictable and repeatable.

This has a very interesting intersection with how data is stored. It implies that the precision of floating points should not change. In these situations, data needs to be stored with arbitrary precision also called infinite precision. These requirements might limit the choices of storage systems. Let us look at this in a little detail.

IEEE-754 is the standard defined for floating point arithmetic. For many years this was a binary representation. Java’s float and double are IEEE 754 binary32 and binary64 respectively. These binary floating-point numbers are very efficient for computers to calculate, but because they work in binary and we work in decimals, there are some expectation mismatches.

This is what is called “Limited Precision Arithmetic”.  For instance, 0.1 cannot be stored precisely in a double, and you get oddities like 0.1 + 0.2 turning out to be 0.30000000000000004. They are not a good choice for financial calculations, for instance.

Then there is “Arbitrary Precision Arithmetic” where there can be infinite precision (limited by the memory of the computer). IEEE, in 2008 updated 754 to add support for memory-limited Decimal Floating Arithmetic. They are decimal32, decimal64, decimal128.

Java had an implementation of Arbitrary Precision Arithmetic in the class BigDecimal before IEEE 2008 update. After 2008, a class MathContext was introduced that could specify the context (precision, base) of arithmetic. The whole idea behind this was to support financial kind of calculations with finite memory.

Coming back to our situation, if we are using a double data type, which is “Fixed Precision Binary Arithmetic” we are doomed to encounter inconsistencies. So, what if we just change this to BigDecimal? 

After all, InfluxDB, Elasticsearch, Spark, Kafka, etc is all in Java, right? Well, turns out that there is no support for BigDecimal in  InfluxDB (BigDecimal SupportDIFFICULTY/LOW ) or Elasticsearch (Add BigDecimal data type: ANALYTICS/AGGREGATIONS ).

Spark supports this with a DecimalType class (org.apace.spark.sql.types.DecimalTypes) (spark: [SPARK-26308][SQL] Avoid cast of decimals for ScalaUDFCLOSED ).

There are a few gotchas although (https://issues.apache.org/jira/browse/SPARK-18484, http://ww25.carelesscoding.com/2019/04/09/spark-gotcha-2.html?subid1=20220616-0046-08e5-9ee0-63ad49acac7c).

Python – https://mpmath.org/.  

TimeScaleDB (and its parent Postgres) does support this using NUMERIC data type. Similarly, MongoDB 3.4+ supports this via BSON type decimal which is an IEEE 754 decimal128 implementation. MySQL Supports this via DECIMAL and NUMERIC types.

Detail

This is the granularity at which data is fetched. Depending on the industry and use case, data intervals could be very different. A high-speed turbine might have hundreds of sensors streaming data at millisecond intervals, while the sensors along a power transmission line would emit it in minutes or hours because they’re battery operated and are supposed to run in avid conditions for months and years.

While these are two ends of the spectrum, there are numerous cases where the data stream is typically in the 0.5 – 3s range. Sometimes the speed of data poses limits on what existing solutions can be used. For instance, one might run into a situation where the quotas implied by the cloud iot hubs (or similar services) are limiting, and a convoluted route is necessary.

This level of detail plays another role in visualizing. Most often people want to look at a big range of data, sometimes bigger than what can be fit into the browser or observed by the naked eye. This can be achieved by performing a sampling of incoming streams and storing a copy of sampled data as well as raw data.

Sampling of data, thus becomes an integral part of processing. Note that sampling happens on a per-tag (per-sensor) basis, typically during ingestion. There are many sampling algorithms like Simple Random Sampling, Stratified Sampling, and Reservoir Sampling (all classes of survey sampling).

It is important to ensure that certain interesting sections like peaks, dips, and slopes are captured correctly because it helps visually. Non-uniform sampling techniques like Level Crossing, Levels, and Peaks sampling are commonly used. In some industrial scenarios, data may be noisy, and an appropriate noise correction might be necessary.

Signal Quality of data is another meta attribute, that is captured, and associated with each data point for each tag. In a data-driven decision world, signal quality is a feature to determine confidence in the outcome. This is also abstracted at higher levels, wherein the overall data quality of a system is considered, instead of individual signal quality.

Structure/Context

Data from the same or different sources could represent the same concepts but are structured differently. Most often the data at different places have contextual relationships amongst them, that become visible only in a unified contextual representation of data.

What I mean is, that at the lowest level, all data is flat data. Different persona apply different lenses and make different interpretations based on their domain.

For example, the operations and reliability teams at an industrial site use OSI PI to store their process and condition data, while the Quality team might use a QMS to store the quality outcomes of various batches. 

Sometimes these teams might even call the same things differently, which is a case of terminological heterogeneity in data. Every enterprise has its own set of processes and ontology.

This ontology should be brought in and merged with the flat data to create a context. This is what we call contextualization and the resultant data is called contextualized data.

This data should be easily consumable by programmers, subject matter experts, shop floor teams, and other personas in their terminology. 

This act of bridging together different information at different times of an event and being able to correctly determine the state of events, based on the specific domain is called ContexAlyzation.

Size 

Data Type and Detail are the factors determining the size of data generated. A bioreactor can have 12,000 sensors streaming numeric data at 10ms granularity. From our above metric about size, 24x60x60x2 data points are 1.2MiB, so 24x60x60x100 data points are 60MiB per day. For this bioreactor, it is 60×12,000 = 703 GiB per day or 8 MiBps.

Not quite enough, but due to compliance reasons, you need to save this data for 7 years. That’s 1.8 PB. An enterprise might have hundreds of such bioreactors, streaming at ~ 1GBps. Compared to a power transmission line, where there could be the same number of sensors, but streaming at 15 mins, the data is 24x60x60x(1/(50*60))x12000 = 328 GiB per day or 3.8 MiBps.

Compliance storage is 0.8 PB. There could be thousands of such power transmission lines, making enterprise streaming at ~ 3.7 GiBps. Note that these are pure storage data sizes. There is additional storage due to sampling, and metainformation, so actual storage is 1.8 – 2.2x of the above estimate.

Similarly, the actual processing payload and ingest payload are also 1.3 – 1.8x of the above GiBps values. These are two examples from different domains, but one can encounter a situation where these situations are present in the same enterprise.

These numbers can be very much appreciated, and a key factor in industrial data is relevancy. Most often, the most relevant data is real-time data. Hence, all the ingestion and processing of this data must happen in real-time or near real-time.

Once the data is juiced out, it is stored in layers so only the most relevant data is pulled out first. The historical data is used often as reference benchmarks, as datasets for Machine Learning projects, or for auditing purposes. Thus, storage, and retrieval of such data is a prime characteristic of industrial data.

We will discuss the design principles behind building an Industrial Data Ingestion and Processing Pipeline in an upcoming blog.

Stay in touch