(Richard Tibbetts is Chief Technology Officer at StreamBase Systems. Follow him on Twitter as @tibbetts)
Recently IBM System S (aka Infosphere Streams) has been getting a lot of attention in the press. As is the custom in the news business, writers try to focus on what is new or novel in the technology. Some have talked about the automatic query generation or the massive parallel scalability, which are some of the contributions of the System S research project. Unfortunately others have picked up on the distinction between structured and unstructured data processing and identified that as unique to System S. Philip Howard writing at IT Director is one example, and another comes in the comments on a StreamBase blog post.
The reality is that CEP engines have been processing unstructured data for a long time, and doing it well. For example, unstructured data processing is a major aspect of applications in the defense and intelligence space. As one architect at an intelligence agency put it, you can’t standardize messaging protocols when the other side doesn’t want you to listen.
Unstructured data processing means processing data that won’t fit into a standard relational model. The most unstructured data is media, such as text, audio, or video. Other kinds of unstructured or semi-structured data include XML or packet capture data. These data formats may come as raw streams, or as part of the data payload in events that also contain structured data.
Specialized algorithms and libraries exist to extract meaning from this data, ranging from XPath queries and protocol analyzers to natural language processing (NLP) and machine vision. Different kinds of analysis require different algorithms, but they all share some characteristics. A system for real-time unstructured data processing requires four things:
- Unstructured Data Objects — Data processing begins with data representation. The platform must have an efficient mechanism for representing large data object, integrated with memory management, persistence, and messaging systems.
- Extensible Language — Most unstructured data processing involves specialized algorithms. The language must be extensible so that these domain-specific algorithms can coexist with built-in functionality. Since the algorithms are often already available as libraries, it is important that these APIs support mainstream languages.
- Unstructured-Data Aware Authoring Tools — Authoring tools are a critical part of any modern platform, and it is important that the tools be designed for both real-time processing and for both structured and unstructured data processing.
- Clustering Capabilities — Unstructured data processing applications are often resource intensive. Spreading the load across large numbers of compute servers is a critical scaling technique. Being able to scale via multi-threading across a single large server is also required.
Many commercial CEP systems already have some or all of these capabilities; StreamBase, for example, has all of them. Unstructured data objects can be ingested over the network at very high speed. Advanced text processing plugins can be developed in Java or C++ - and can be developed and debugged in the CEP development, debugging, and testing tools (i.e., StreamBase Studio.) And event processing applications that analyze structured data can be deployed and scaled across large clusters.
One reasonably infers that System S has these capabilities, though for the moment it is difficult to find information on authoring tools.
Now Philip Howard suggests that System S is more capable in unstructured data processing because it does not use SQL. This is a non sequitur. The additional power afforded StreamBase by using SQL does not harm developers of unstructured data processing applications. Instead, it increases the flexibility of their systems, and the speed with which they can learn the system. One system, developed using StreamBase, first extracts meaning from unstructured text, and then uses that meaning data to identify which analysts need to be alerted to this data. The conditional alerting is expressed best in SQL, while the text processing uses unstructured facilities. Developers can build alerting logic using StreamSQL EventFlow and integrate the text processing algorithms without learning an entirely new language.
The IBM SPADE language is not based on SQL or any other language familiar to developers. This may have given IBM researchers a lot of freedom to experiment when designing the system, but it won’t help enterprise programmers learn the language quickly, and it certainly is not a silver bullet for unstructured data processing.