This post is a part of a series of best practices from leading StreamBase enterprise customers who have effectively deployed "real-time" business systems.
"Can StreamBase handle 300 million messages a day?" asked a senior IT manager the other day.
"Well, it's common for a single StreamBase server to handle 100,000 messages a second for applications like yours," said our field architect.
"OK, but can you handle 300 million messages a day?"
Why this bizarre miscommunication? 100,000 messages per second is 8.7 BILLION messages per day, yet the IT manager was worried StreamBase wouldn't handle his volumes.
Experienced event processing architects think about scalability in a different way. So, for this best practice blog entry, I teamed up with one of our most senior field architects, Hayden Schultz, to describe the art of scalability design and how to think about scalability in real-time systems.
THE SITUATION: AGGREGATE DATA CAPACITY PLANNING IS COMMON PRACTICE
Many technologists think about capacity planning in terms of aggregate data volumes. That is, they predict how much data they need to process in a day and provision their network, database, and hardware accordingly.
THE PROBLEM: REAL-TIME DEMANDS DIFFERENT METRICS
Aggregate volume isn't the right way to plan real-time systems. Real-time systems often have a 100-to-one, 1,000-to-one, or even a million-to-one ratio of event-computation-to-action ratio. Technologists need to think in terms of stream burst rates, and what the right behavior is during bursts.
THE SOLUTION: THINK IN THREE SCALABILITY DIMENSIONS
There are three dimensions to scalability in real-time systems:
1) Bursts behavior. It's not uncommon for bursts to be so narrow that you don't really need to design around them because while performance will degrade during the burst, the burst is fairly brief so the system performance degrades so briefly that it's not noticeable. A stable real-time system is designed so that the maximum data rate that it's capable of processing is greater than the average data rate, so if a burst is reasonably narrow, the latency of the responses increases during the burst while incoming data is buffered; when the burst is over, the backlog of data is quickly processed and the system resumes its normal performance.
2) Consider Conflation. Increasing the response latency during bursts can be the wrong approach for some systems, but the cost of designing for spikes 3-5X higher than the average can be expensive. A technique called conflation helps mitigate this issue. Processing events that are buffered during bursts has the unfortunate tendency to process events that have already been superseded by more recent events, so the system is using its overtaxed resources to compute results that are obsolete while letting the most current results age and perhaps become obsolete in turn. Conflation means that new events overwrite processed buffered events so that only the most current data is processed. The data rate is reduced and obsolete events are not used. Conflation may slightly increase average response time, but can dramatically reduce latency during traffic spikes.
3) Computation reduction. Some systems have computational loads so that a new result cannot be produced for each input event. These systems typically produce results at a fixed frequency. This is common when the result is sent to a graphical user interface (GUI). A typical GUI can't update nearly as quickly as many CEP applications can produce results, and even if it could the updates would overwhelm the users trying to use the GUI. Sending results at a fixed rate solves this problem. If the data rate of events entering the CEP system should become too large a common strategy is to reduce the computation frequency. When the burst is over, the computation frequency can be increased to its normal rate.
THE IMPLICATION: CORRECT SYSTEMS AT THE RIGHT COST
Planning system capacity in terms of burst rates, data reduction, and computational reduction helps balance correct system behavior and cost, and turns handling a billion messages a day into child's play.