In the part 1 of this series, we talked about the speed-consistency-volume trade-offs that come along with implementation choices you make in data analytics and why Cassandra is a great choice for real-time analytics. In this post, we’re going to dive a little deeper on the basics of the Cassandra data model and illustrate with the help of MarkedUp’s own model, followed by a short discussion about our read and write strategies.
Once again, lets start off of our LA Cassandra User group meetup’s presentation deck on slideshare. Slides 12-18 are relevant for this post.
Cassandra Data Model
The Cassandra data model consists of a keyspace (analogous to a database), column families (analogous to tables in the relational model), keys and columns. Here’s what the basic Cassandra table (also known as a column family) structure looks like:
Figure 1. Structure of a super column family in Cassandra
Cassandra’s API also refers to this structure as a map within a map. where the outer map key is the row key and inner map key is the column key. In reality, a typical Cassandra keyspace for, say, an analytics platform, might also contain what’s known as a super column family.
Figure 2. Structure of a super column family in Cassandra
Evan weaver’s blogpost has a good illustration of the twitter keyspace as a real world example.
MarkedUp’s keyspace has column families such as DailyAppLogs (that count the number of times a particular log or event triggered per app) and Logs (that capture information about each log entry). These are also illustrated in figure 1 below.
The Datastax post about modeling a time series with Cassandra is particularly helpful in deciding upon the schema design. We index our columns on the basis of dates.
Note that since we use a randomPartitioner where rows are ordered by the MD5 of their keys, using dates as column keys helps in storing data points in a sorted manner within each row. Other analytics applications might prefer indexing by hours or even minutes, if, for example, the exact time of day when the app activity peaks needs to be measured and reported. The only drawback would be more data points and more columns in the keyspace. With a limit of about 2 billion column families in Cassandra though, its almost impossible to exceed the limit. Thus, the fact that Cassandra offers really wide column families leaves us with enough leg room.
The row key in a Cassandra column family is also the “shard” key, which implies that columns for a particular row key are always stored contiguously and in the same node. If you are worried that some of your shards will keep growing at a faster rate than others, resulting in “hotspot” nodes that store those shards, you can further shard your rows by means of composite keys. Eg: (App1, 1) and (App1, 2) can be two shards for App1.
The counter for all events of a particular type coming from apps using MarkedUp are recorded in the same shard. (“What about hotspots then?”, you might wonder! Well, Cassandra offers semi-automatic load balancing so we load balance if a node starts becoming a hotspot. Refer to the Cassandra wiki for more on load balancing)
MarkedUp’s Read/Write Strategy
Now that we have a better understanding of the Cassandra data model, lets look at how we handle writes in MarkedUp. Logs from the Windows 8 apps that use Markedup arrive randomly on a daily basis. For incoming logs, we leverage the batch mutate method.
As you might have probably guessed, a batch_mutate operation groups calls on several keys into a single call. Each incoming log, therefore, triggers updates or inserts in multiple column families, as shown in figure 3. For example, a RuntimeException in AppX on Jan1, 2013 will update the DailyAppLogs CF with key AppX by incrementing the counter stored in the column key corresponding to Jan1, 2013 as well as the Logs CF by inserting a new key LogId.
Figure 3. MarkedUp’s write strategy
MarkedUp’s read strategy leverages Cassandra’s get_slice query, which allows you to read a wide range of data focused on the intended query, reducing waste (A ‘slice’ indicates a range of columns within a row). A query to count a wide range of columns can be performed in minimal disk I/O operations. Setting up a get_slice query is as simple as specifying which keyspace and column family you want to use and then setting up the slice predicate by defining which columns within the row you need.
The slice predicate itself can be set up in two ways. You can either specify exactly which columns you need, or you can specify a range of ‘contiguous’ columns using a splice range. Using column keys that can be sorted meaningfully is thus critical.
Figure 4 below illustrates the query “Get all Crash and Error logs for App1 between Date1 and DateN”. The get_slice_range query can easily read the counters as a complete block from the AppLogsByLevel CF because the CF is sorted by dates.
Figure 4. MarkedUp’s read strategy
If you’ve read our previous blog post closely, you might be wondering if the returned information is even correct, given the fact that Cassandra compromises on consistency in favor of speed and volume (remember the SCV triangle?). Cassandra guarantees what is known as eventual consistency, which means that at some given point (milliseconds away from the triggering of the write operation), some nodes may still have the stale value, although by the end of the operation, every node will have been updated.
Luckily, Cassandra offers tunable consistency levels for queries. So, depending on your appetite for consistent output vis-a-vis speed, you can configure the desired consistency level by chosing different levels of “quorum”. MarkedUp uses ONE for writes and TWO for reads, to keep the web front-end as fluid as possible.
In the part 3 of this series, we’ll talk about some best practices of working with Cassandra and choosing a schema that fits your needs. Stay tuned for more!