Data in Motion

The method of operating on data-in-motion in which data is operated on immediately as it arrives without the requirement that it first be stored in some kind of database is an alternative to batch processing. It is typically implemented as an in-memory technology with extremely low end-to-end latency. Depending on the operations being performed and the available processing hardware it is not unusual to have overall latency from the time a new data element arrives to when the results have been updated to be in the sub-millisecond range.

Note that the performance described above is not just the ability to ingest large volumes of data at high rates, but it also includes the completion of all processing necessary to produce the “answer” or to update a predicted value. This capability allows the organization to react immediately to current situations or changing conditions. Since the results are updated continuously, an understanding of the exact current state is always available. Where predictive models exist, the system can also update both the models and the results instant by instant according to the latest available data. This moves the insight provided from what has happened previously to what is happening now while providing more accurate predictions of future situations by taking into account up-to-the-moment information. It also enables the agility necessary to accommodate evolving problem spaces.

These characteristics are especially appropriate for applications that emphasize the immediate, such as feeding a real-time dashboard or triggering automated alerts or behaviors. One of many things to consider when evaluating whether data-in-motion processing would be useful for your application is how quickly your business needs to, or can respond to, the results. If a daily or monthly report is sufficient, you may not benefit from a low latency solution.

In addition to ultra low latency, a different kind of capability that data-in-motion systems can provide is as a front-end processor to data-at-rest systems. Used in this way a stream processing system can perform operations such as cleansing, normalizing, filtering, and aggregation. These kinds of operations can result in higher value and lower quantities of data being presented to the downstream systems.

As capable as it is, the data-in-motion concept is often best deployed as a complement to data-at-rest systems. Understanding the immediate state in context with historical behavior is a very powerful combination and is made even more so when this interaction can be handled automatically. This combination leverages the low latency of the in-memory data-in-motion system with the ability of the data-at-rest system to hold long periods of historical information. While there is certainly some crossover, it is appropriate for many use cases to think of using the data-at-rest system for data-mining and model creation (i.e. determining what to be on the lookout for) and the data-in-motion for real-time scoring (i.e. doing the looking).

InfoSphere Streams
As the understanding of the benefits of a data-in-motion, stream processing is increasingly becoming more widespread, the availability of platforms that support the paradigm is also growing. The initial entry into the space was a technology developed by IBM in cooperation with the US Government that eventually was productized as InfoSphere Streams. There is some similarity with this particular product with Complex Event Processing (CEP) systems although there are also some substantial differences. Also falling into this space are relatively new open source stream processing systems such as S4, Storm, and Spark. A somewhat different offering that also has substantial overlap for operating on certain forms of machine data such as system logs is Splunk.

Many considerations need to be taken into account in order to choose wisely when selecting a stream processing platform. There are a growing number of platforms that claim to accommodate the three V’s of big data (volume, variety, and velocity) and that may be sufficient for various applications that are not necessarily mission critical. But there are many use cases for which many other things should also be considered. Examples include:

The maturity of the platform – this reflects on how evolved, enhanced, and stable the resulting solutions are, as well as the availability of comprehensive development tools.
The support model (e.g. open source vs. commercial product, and the vibrancy of the user community)
How highly the system is optimized to be computationally efficient (i.e. does it make the most effective use of the available hardware with minimal overhead or does it make up for relatively high resource needs by running on more nodes.)
Scalability – can it scale up to very large hardware, out to thousands of processors, and down to small, possibly embedded, elements operating in a federated system-of-systems.
Capable – In addition to basic business logic (e.g. if this happens then do that) can also natively handle mathematical functions such as Fourier transforms on A/V data, sentiment and other forms of text analysis on unstructured natural language, geospatial analytics, time series calculations, complex event processing, SPSS modeling and R statistics.
Does it have direct support for extension using either high performance or “convenient” standard languages (C++, Java, Python, Pearl, etc.)
Is the platform both modular, agile? Modularity allows large and/or complex applications to be decomposed into basic components that are more easily maintained, optimized, and reused. Agility refers to how easily it is to modify and/or enhance an existing system with minimal operational disruption.
How simple but expressive is the language or other paradigm used to program the system. Stream processing involves a different way of thinking than other programming models and language specialized for that purpose makes it easier to write high quality data-in-motion applications.
How robust, reliable and capable is the runtime environment.
How rich and comprehensive are available libraries and development tools

In our experience the evaluation of the points listed above along with many others has led the selection of the InfoSphere Streams platform as the obvious choice for the majority of our clients’ mission critical solutions.

Resources to learn more
In addition to contacting us for a personalized conversation, you can find more information on stream processing in general and the InfoSphere Streams platform at the resources below:

Product website – including two forms of a quick start edition and links to whitepapers and other resources.
Streamsdev – Developer community website for Streams

Books:
            IBM InfoSphere Streams: Assembling Continuous Insight in the Information Revolution
            Addressing Data Volume, Velocity, and Variety with IBM InfoSphere Streams V3.0
            Fundamentals of Stream Processing: Application Design, Systems, and Analytics
Videos:
            Streams Developer Ed (7 part series of short topics)
            Combining data-in-motion scoring with data-at-rest model creation (2 minutes)

Data in motion