Fast Data or Big Data: What's right for you?
Big data is getting bigger via a constant and regular stream of incoming data. This data is arriving at incredible rates in high-volume environments, and has to be analyzed and stored effectively.
About a decade ago, it was certainly impossible to imagine that petabytes of historical and real time data could be analyzed using commodity hardware. Today, it is commonplace to find huge Hadoop clusters developed from thousands of nodes; with open source technologies making it possible for virtualized and commodity hardware to process millions of gigabytes of big data, all in affordable ways.
Fast Data—Its Association with Big Data
In a similar vein, zettabytes of data are arriving at breakneck speed in yet another revolution termed as “fast data.” In the contemporary scenario, big data is generated at incredible speeds in the form of financial ticker data, click-stream data, sensor data or log aggregation. More often than not, these events take place at the rate of 1000s/10000s times per second and are referred to as the "fire hose."
While talking about such fast appearing data in context with big data, data warehouses do not measure volume in terms of gigabytes, terabytes, or petabytes. Instead, they use time as a measure of volumes: gigabytes per hour, megabytes per second, or terabytes per day (Here's the perfect parcel of information to learn data science).
Here big data is not just big; it is also fast!
Getting Value from Fast Data
The benefits of fast data cannot be achieved if fast-moving, fresh data from the fire hose is stored into an analytic RDBMS, HDFS, or flat files. This is because the data loses its ability to alert or act in real time and fails to represent active data or immediate status with ongoing purposes. In contrast, the data warehouse serves as a proven way of analyzing historical data, and predicting the future.
Taking action on fast data, as and when it arrives, is considered as impractical and costly, if not impossible, especially in the case of commodity hardware. As in the value of big data, fast data is unlocked with the implementation of open source streaming systems (Kafka and Storm), message queues, and introduction of NoSQL and NewSQL offerings to derive optimum value.
Fast Data: Ways of Capturing Value
The best way of capturing the value of incoming fast data is to show a reaction the instant it arrives. The act of processing this incoming data in the form of batches makes one lose time, and hence the value of data. Data arriving at the rate of millions of events/ second needs two technologies:
An effective streaming system that’s capable of delivering events as soon as they arrive
A data store that is capable of processing each item at the same speed, as it arrives
Delivering Fast Data
Apache Kafka and Apache Storm are popular streaming systems that have managed to make their presence felt in the last few years. Developed by Twitter’s engineering team originally, Storm reliably processes unbounded data streams at the rate of millions of messages/ second. On the other hand, Kafka, which is developed by LinkedIn’s engineering team, serves as a distributed, high-throughput queue system for messages. Though both streaming systems are capable of addressing the need of fast data processing, Kafka, stands apart.
Designed to provide solutions to the perceived problems of in-use technologies and serve as a message queue, Kafka acts as an über-queue boasting of distributed deployments, unlimited scalability, strong persistence and multitenancy. One Kafka cluster in an organization is more than enough to satisfy all message queuing needs (also consider checking out this career guide for data science jobs).
Processing of Fast Data
Traditional relational databases are limited in performance. While some are well equipped to store large volumes of data at high rates, they seldom succeed when asked to enrich, validate or act on ingested data. In contrast, NoSQL systems embrace clustering and showcase high performance, even though they fail to deliver the safety and power of traditional SQL-based systems. NoSQL solutions are capable of satisfying the basic business needs of fire hose processing but cannot handle the execution of business logic operations and complex queries per event with flair. In such cases, NewSQL solutions are capable of satisfying the needs of transactional complexity and performance, and to the hilt.
An effective system for processing the fire hose should:
Effectively include the scalability and redundancy benefits of shared-nothing (native) clustering
Lean on in-memory processing and storage to achieve high throughput (per-node) storage
Allow processing at the time of ingestion, perform conditional logic, and query gigabytes or more to make informed decisions
Make strong guarantees with regards to operations and isolate them
These features allow users to write simpler codes and focus on immediate business issues, rather than handle data divergence or concurrency problems. It’s good to stay away from systems that may offer strong consistency at reduced performance levels.
Regardless of your organizational needs, a smart combination of high velocity data tools will go a long way in replacing disparate and more fragile systems. So, get ready to:
Enable new services and methods that seemed impossible before
Offer enhanced customer experiences via real-time and personalized interactions
Effectively manage system resources
Enjoy increased visibility and predictability for achieving higher operational quality
All the best!