Stream Processing
Stream processing is a key requirement
In big data applications, stream processing is a key requirement. As soon as an application computes anything of value the organization will want to compute the result immediately and continuously, unlike in case of batch processing, where computations happen only after a certain batch is created.
What is stream processing ?
Stream processing is the act of continously incorporating new data to compute a result. In stream processing, input data is unbounded. There is no pre-determined begining or end. Streams are simple series of events. For example: credit card or upi transactions, sensor readings from IoT devices, clicks on a website, etc...
Stream Application
There are multiple problem statements and applications that can designed for data streams. The most common applications are running counter kind of applications or what is called as dashboard. For example for an e-commerce company, a "live" dashboard with running count of hourly, daily, weekly statistics of items sold, items returned, per category sales, per town / city / geography sales etc... are of paramount importance.
Streaming and Batch Processing
Altough streaming and batch processing sound different, they often work together. The outputs of computations performed during batch processing usually also form additional inputs while processing streams. (Joins between batch processed data and input streams). Similarly the "data sinks", that is the raw streaming data usually also form input for subsequent batch processing. Therefore it is of utmost importance that application logic is consistent across streaming and batch processing input.
Examples of Streaming Applications
- Notifications and Alerts
Keep an eye on continuous stream of events. Generate notifications and alerts for events of interest.
- Real Time Reporting and Decision Making
Provide real time dashboards as well as decision making heuristics to detect say credit card frauds and decline transactions.
Advantages of Stream Processing
Low Latency
A stream processing application which keeps its state in memory can respond to real time events faster than batch processing.
Efficiency
In some scenarios a streaming application is more efficient than batch processing application. For example in case of statistical computations which are incremental in nature (calculating average) streaming application will be much more efficient than batch processing application.
Challenges of Stream Processing
Receiving events out of order
Timestamped events may be received out of order because of delays, re-transmissions etc...
Maintaining state
Special application logic needs to be developed to maintain state.
Processing event exactly once
Depsite machine failures and retries, it is critical that each event is processed exactly once.
Integrating with other storage systems
It is critical that stream processing systems join with data stored in other systems.
Design of a stream processing systems
Architects have a few choices as well as decisions to make while designing stream processing systems.
One record at a time v/s Declarative API
One record at a time is the simplest approach to designing a stream processing system. In this case the framework sends each event to the application and let it react with custom code. This is most useful when applications want full control over processing of data. Declarative APIs are when applications tell the framework what to compute (in response to each event) but not how to compute it and recover from failure. Google DataFlow, Apache Kafka, Spark Structured Streaming are some examples of Declarative API
Event Time v/s Processing Time
For systems supporting Declarative API another concern is timestamping each event. Should the timestamping be done at the source (by event generator) or by the streaming application at the time of processing of the event. For some applications event time is important. Examples are: IoT sensor reading and financial data.
Continuous v/s Micro-batch execution
With regards to execution the streaming systems have two choices. Perform continuous execution one event at a time or create small micro-batches by accumalating a few events and then processing them in a batch. In continuous execution systems overall latency is low (especially when input rate is low) but have low maximum throughput because they incur a significant overhead per event. Another challenge is that continuous processing systems are inflexible in cases of variable flow of input events and therefore might require more processing nodes as compared to micro-batch systems.
Micro-batch execution systems have high throughput, require less number of computation nodes but have higher latency.