In a previous blog post titled “Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack” we talked about our process for selecting Cassandra as our data system of choice for supporting our real-time analytics platform.
We’ve been live with Cassandra in production for a couple of months now and shared some of the lessons and best practices for implementing it at the Los Angeles Cassandra Users Group on March 12, 2013. You can see the presentation on slideshare if you’d like to view our slides.
We wanted to expand on what we shared in the presentation itself and share some of our applied knowledge on how to put Cassandra to work in the field of real-time analytics.
Let’s start by helping you understand how an analytics system needs to be built.
Real-time Analytics and the CAP Theorem
For those of you who aren’t familiar with Brewer’s CAP theorem, it stipulates that it is impossible for any distributed computer system to simultaneously provide all three of the following guarantees:
In the real-world all distributed systems fall on a gradient with each of these three guarantees, but the kernel of truth is that there are trade offs. A system with high partition tolerance and availability (like Cassandra) will sacrifice some consistency in order do it.
When it comes to analytics, there’s a transitive application of the CAP theorem to analytic systems – we call it SCV:
Speed is how quickly you can return an appropriate analytic result from the time it was first observed – a “real-time” system will have an updated analytic result within a relatively short time of an observed event, whereas a non-real-time system might take hours or even days to process all of the observations into an analytic result.
Consistency is how accurate or precise (two different things) the analytic outcome is. A totally consistent result accounts for 100% of observed data accounted for with complete accuracy and some tunable degree of precision. A less consistent system might use statistical sampling or approximations to produce a reasonably precise but less accurate result.
Data Volume refers to the total amount of observed events and data that need to be analyzed. At the point when data starts to exceed the bounds of what can be fit into memory is when this starts to become a factor. Massive or rapidly growing data sets have to be analyzed by distributed systems.
If your working data set is never going to grow beyond 40-50GB over the course of its lifetime, then you can use an RDBMS like SQL Server or MySQL and have 100% consistent analytic results delivered to you in real-time – because your entire working set can fit into memory on a single machine and doesn’t need to be distributed.
Or if you’re building an application like MarkedUp Analytics, which has a rapidly growing data set and unpredictable burst loads, you’re going to need a system that sacrifices some speed or consistency in order to be distributed so it can handle the large volume of raw data.
Think about this trade off carefully before you go about building a real-time analytics system.
What Data Needs to be Real-time?
“Egads, ALL DATA SHOULD BE ALWAYS REPORTED IN REAL-TIME!” shouted every software developer ever.
Hold your horses! Real-time analytics forces a trade off between other important factors like accuracy / precision and data size. Therefore, real-time analytics isn’t inherently superior or better for every conceivable use case.
Real-time analysis is important for operational metrics and anything else you or your users need to respond to in real-time:
Error rates or health monitoring;
Dynamic prices, like stock prices or ticket fares;
On-the-spot personalizations and recommendations, like the product recommendations you might see when browsing Netflix or Ebay.
In these scenarios, the exact price or the exact error rate isn’t as important the rate of change or confidence interval, which can be done in real-time.
Retrospective or batch analysis is important for product / behavior analysis – these are metrics that tell you how you should or shouldn’t do something, and they are data that you can’t or shouldn’t respond to in real-time.
You don’t want to redesign your product based on fluctuations during day-to-day use – you want to redesign it based on long-term trends over all of your cohorts, and it naturally takes a long time for that data to accrue and be analyzed / studied.
In this type of analysis it’s more important for the data to be comprehensive (large) and accounted consistently.
Analytic speed is always a trade-off between data volume and consistency – you have to be concious of that when you design your system.
The one property you’re never going to want to intentionally sacrifice is data volume – data is a business asset. Data has inherent value. You want to design your analytic systems to consume and retain as much of it as possible.
At MarkedUp we use a blend of both real-time and retrospective analytics:
In this post and the next we’re going to focus on how we use Cassandra for real-time analytics – we use Hive and Hadoop for our retrospective analysis.
Why Cassandra for Real-time Analytics?
Cassandra is an excellent choice for real-time analytic workloads, if you value speed and data volume over consistency (which we do for many of our metrics.)
So what makes Cassandra attractive?
- Cassandra is highly available and distributed; it has high tolerance to individual node failures and makes it possible to add multi-data center support easily if data affinity or sovereignty is an issue. On top of that it’s easy to expand a Cassandra cluster with new nodes if necessary (although this shouldn’t be done frivolously since there is a high cost to rebalancing a cluster.)
- It has amazing write performance; we’ve clocked Cassandra writes taking up to 200µs on average for us, and that’s doubly impressive considering that most of our writes are big, heavily denormalized batch mutations.
- Batch mutations give us the ability to denormalize data heavily and update lots of counters at once – in Cassandra it’s generally a good idea to write your data to make it easy to read back out, even if that means writing it multiple times. Batch mutations make this really easy and inexpensive for our front-end data collection servers.
- Distributed counters were added to Cassandra due at Twitter’s insistence, and they’re a major boon to anyone trying to build real-time analytic systems. Most of MarkedUp’s real time analytics are implemented using counters – they provide a simple, inexpensive, and remarkably consistent mechanism to update metrics and statistics at write time. There are some trade offs (namely the loss of idempotency) but they make up for it in simplicity and speed.
- Physically sorted columns are one of the Cassandra database implementation details worth learning, because with it you can create easily predictable and pre-sorted slices of data. This makes for really efficient storage of time-series data and other common types of analytic output. When you combined physically sorted columns with dynamic columns and slice predicates you can create lookup systems which retrieve large data sets in constant time.
- Dynamic columns are a Cassandra feature that takes getting used to, but they are enormously powerful for analytic workloads when coupled with sorted columns – they allow you to create flexible, predictable data structures that are easy to read and extend.
We’re going to publish a series of posts about working with Cassandra for real-time analytics. Make sure you read part 2 where we go into detail on our read / write strategy with Cassandra for analytics!