Introduction to Parameterized Session Events

One of the most commonly requested features for MarkedUp has been the ability to support parameterized session events, and as of the release of MarkedUp Analytics SDK for .NET 1.0 this feature is now publicly available to our customers.

Parameterized session events help you group and categorize related user events that occur inside your applications; this gives you the ability to maintain a high-level view of how people are using your application while also allowing you to expose more details if needed.

For example: I have a Hacker News Windows Store application that I’ve been working on in my spare time and I want to be able to track how many articles people actually view when they use it.

So, I embedded a standard MK.SessionEvent call inside my WinJS application to track how many people viewed an article:

And this allows me to see the total number of times someone has viewed an article using my app:

Hacker News application with a simple, unparameterized event

But what if I want more information – what if I want to know which articles users are viewing? Parameterized session events can help us with that.

I’m going to modify the call to MK.SessionEvent and include a new parameter: “ArticleUrl.”

“ArticleUrl” is the name of my parameter, and the value of this is the URL of whatever article a user is viewing. In C# / XAML you can just pass in an IDictionary<string, string> object for all of your parameters and values, but in WinJS you need to use a Windows.Foundation.Collections.PropertySet collection.

Once I’ve run the app with this custom parameter included, I can drill down into the ViewArticle event and see a new parameter:

Hacker News app with an ArticleUrl parameter for the ViewArticle session event

And if I click on the “ArticleUrl” parameter, guess what we see next? A list of all of the URLs for each event!

hacker-news-viewarticle-articleurl-parameter-values

With that simple change, our ViewArticle event becomes much more informative and valuable – we can even use the date picker to see which articles were most popular amongst our userbase during each month or week!

Advanced Example: Distinct Session Event Parameters

Let’s take our usage of parameterized session events a step further. One of the features of my Hacker News application is the ability for it cache articles to the local filesystem in order to conserve bandwidth. I’m considering extending this capability to enable the app to run entirely offline – so my question is: how often do people really look at cached articles?

Parameterized session events can help us answer this question.

I’m going to move my MK.SessionEvent call from viewItem.js and into my caching layer – that way I can tell if the article I am serving is cached or not. Here’s my code looks like for that:

I have to different methods that are called by my app when we need to view an article – one that checks if the content is cached and serves it if it is, and another that downloads the content from the Internet.

Whenever we serve the content from cache, I pass in a ps[“Cached”] = true parameter and a ps[“NotCached”] = true whenever we don’t.

So after running the app a few times, it looks like most users don’t view cached articles very often:

Hacker News WinJS app - cached vs. uncached articles

I’m doing it this way so I can see directly at the ViewArticle view how many articles are served from cache versus those who aren’t without having to drill down any further.

But the choice is yours – parameterized session events are designed to be flexible and allow you to use them however you wish.

Make sure you check out our tutorial on using parameterized events with MarkedUp Analytics, and please let us know if you have feedback!

Using Cassandra for Real-time Analytics: Part 2

In the part 1 of this series, we talked about the speed-consistency-volume trade-offs that come along with implementation choices you make in data analytics and why Cassandra is a great choice for real-time analytics. In this post, we’re going to dive a little deeper on the basics of the Cassandra data model and illustrate with the help of MarkedUp’s own model, followed by a short discussion about our read and write strategies.

Once again, lets start off of our LA Cassandra User group meetup’s presentation deck on slideshare. Slides 12-18 are relevant for this post.

Cassandra Data Model

The Cassandra data model consists of a keyspace (analogous to a database), column families (analogous to tables in the relational model), keys and columns. Here’s what the basic Cassandra table (also known as a column family) structure looks like:

Cassandra Column Family structure

  Figure 1. Structure of a super column family in Cassandra

Cassandra’s API also refers to this structure as a map within a map. where the outer map key is the row key and inner map key is the column key. In reality, a typical Cassandra keyspace for, say, an analytics platform, might also contain what’s known as a super column family.

 

Cassandra Super column family structure

                                              Figure 2. Structure of a super column family in Cassandra

 

Evan weaver’s blogpost has a good illustration of the twitter keyspace as a real world example.

MarkedUp’s keyspace has column families such as DailyAppLogs (that count the number of times a particular log or event triggered per app) and Logs (that capture information about each log entry). These are also illustrated in figure 1 below.

The Datastax post about modeling a time series with Cassandra is particularly helpful in deciding upon the schema design. We index our columns on the basis of dates.

Note that since we use a randomPartitioner  where rows are ordered by the MD5 of their keys, using dates as column keys helps in storing data points in a sorted manner within each row. Other analytics applications might prefer indexing by hours or even minutes, if, for example, the exact time of day when the app activity peaks needs to be measured and reported. The only drawback would be more data points and more columns in the keyspace. With a limit of about 2 billion column families in Cassandra though, its almost impossible to exceed the limit. Thus, the fact that Cassandra offers really wide column families leaves us with enough leg room.

The row key in a Cassandra column family is also the “shard” key, which implies that columns for a particular row key are always stored contiguously and in the same node. If you are worried that some of your shards will keep growing at a faster rate than others, resulting in “hotspot” nodes that store those shards, you can further shard your rows by means of composite keys. Eg: (App1, 1) and (App1, 2) can be two shards for App1.

The counter for all events of a particular type coming from apps using MarkedUp are recorded in the same shard. (“What about hotspots then?”, you might wonder! Well, Cassandra offers semi-automatic load balancing so we load balance if a node starts becoming a hotspot. Refer to the Cassandra wiki for more on load balancing)

MarkedUp’s Read/Write Strategy

Now that we have a better understanding of the Cassandra data model, lets look at how we handle writes in MarkedUp. Logs from the Windows 8 apps that use Markedup arrive randomly on a daily basis. For incoming logs, we leverage the batch mutate method.

As you might have probably guessed, a batch_mutate operation groups calls on several keys into a single call. Each incoming log, therefore, triggers updates or inserts in multiple column families, as shown in figure 3. For example, a RuntimeException in AppX on Jan1, 2013 will update the DailyAppLogs CF with key AppX by incrementing the counter stored in the column key corresponding to Jan1, 2013 as well as the Logs CF by inserting a new key LogId. 

MarkedUp write strategy

Figure 3. MarkedUp’s write strategy

 

MarkedUp’s read strategy leverages Cassandra’s get_slice query, which allows you to read a wide range of data focused on the intended query, reducing waste (A ‘slice’ indicates a range of columns within a row). A query to count a wide range of columns can be performed in minimal disk I/O operations. Setting up a get_slice query is as simple as specifying which keyspace and column family you want to use and then setting up the slice predicate by defining which columns within the row you need.

The slice predicate itself can be set up in two ways. You can either specify exactly which columns you need, or you can specify a range of ‘contiguous’ columns using a splice range. Using column keys that can be sorted meaningfully is thus critical.

Figure 4 below illustrates the query “Get all Crash and Error logs for App1 between Date1 and DateN”. The get_slice_range query can easily read the counters as a complete block from the AppLogsByLevel CF because the CF is sorted by dates.

MarkedUp read strategy

Figure 4. MarkedUp’s read strategy

 

If you’ve read our previous blog post closely, you might be wondering if the returned information is even correct, given the fact that Cassandra compromises on consistency in favor of speed and volume (remember the SCV triangle?). Cassandra guarantees what is known as eventual consistency, which means that at some given point (milliseconds away from the triggering of the write operation), some nodes may still have the stale value, although by the end of the operation, every node will have been updated.

Luckily, Cassandra offers tunable consistency levels for queries. So, depending on your appetite for consistent output vis-a-vis speed, you can configure the desired consistency level by chosing different levels of “quorum”. MarkedUp uses ONE for writes and TWO for reads, to keep the web front-end as fluid as possible.

In the part 3 of this series, we’ll talk about some best practices of working with Cassandra and choosing a schema that fits your needs. Stay tuned for more!

Using Cassandra for Real-time Analytics: Part 1

In a previous blog post titled “Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack” we talked about our process for selecting Cassandra as our data system of choice for supporting our real-time analytics platform.

We’ve been live with Cassandra in production for a couple of months now and shared some of the lessons and best practices for implementing it at the Los Angeles Cassandra Users Group on March 12, 2013. You can see the presentation on slideshare if you’d like to view our slides.

We wanted to expand on what we shared in the presentation itself and share some of our applied knowledge on how to put Cassandra to work in the field of real-time analytics.

Let’s start by helping you understand how an analytics system needs to be built.

Real-time Analytics and the CAP Theorem

For those of you who aren’t familiar with Brewer’s CAP theorem, it stipulates that it is impossible for any distributed computer system to simultaneously provide all three of the following guarantees:

  1. Consistency;

  2. Availability; and

  3. Partition tolerance.

In the real-world all distributed systems fall on a gradient with each of these three guarantees, but the kernel of truth is that there are trade offs. A system with high partition tolerance and availability (like Cassandra) will sacrifice some consistency in order do it.

When it comes to analytics, there’s a transitive application of the CAP theorem to analytic systems – we call it SCV:

analytics cap theorem

  1. Speed is how quickly you can return an appropriate analytic result from the time it was first observed – a “real-time” system will have an updated analytic result within a relatively short time of an observed event, whereas a non-real-time system might take hours or even days to process all of the observations into an analytic result.

  2. Consistency is how accurate or precise (two different things) the analytic outcome is. A totally consistent result accounts for 100% of observed data accounted for with complete accuracy and some tunable degree of precision. A less consistent system might use statistical sampling or approximations to produce a reasonably precise but less accurate result.

  3. Data Volume refers to the total amount of observed events and data that need to be analyzed. At the point when data starts to exceed the bounds of what can be fit into memory is when this starts to become a factor. Massive or rapidly growing data sets have to be analyzed by distributed systems.

If your working data set is never going to grow beyond 40-50GB over the course of its lifetime, then you can use an RDBMS like SQL Server or MySQL and have 100% consistent analytic results delivered to you in real-time – because your entire working set can fit into memory on a single machine and doesn’t need to be distributed.

Or if you’re building an application like MarkedUp Analytics, which has a rapidly growing data set and unpredictable burst loads, you’re going to need a system that sacrifices some speed or consistency in order to be distributed so it can handle the large volume of raw data.

Think about this trade off carefully before you go about building a real-time analytics system.

What Data Needs to be Real-time?

“Egads, ALL DATA SHOULD BE ALWAYS REPORTED IN REAL-TIME!” shouted every software developer ever.

Hold your horses! Real-time analytics forces a trade off between other important factors like accuracy / precision and data size. Therefore, real-time analytics isn’t inherently superior or better for every conceivable use case.

Real-time analysis is important for operational metrics and anything else you or your users need to respond to in real-time:

  • Error rates or health monitoring;

  • Dynamic prices, like stock prices or ticket fares;

  • On-the-spot personalizations and recommendations, like the product recommendations you might see when browsing Netflix or Ebay.

In these scenarios, the exact price or the exact error rate isn’t as important the rate of change or confidence interval, which can be done in real-time.

Retrospective or batch analysis is important for product / behavior analysis – these are metrics that tell you how you should or shouldn’t do something, and they are data that you can’t or shouldn’t respond to in real-time.

You don’t want to redesign your product based on fluctuations during day-to-day use – you want to redesign it based on long-term trends over all of your cohorts, and it naturally takes a long time for that data to accrue and be analyzed / studied.

In this type of analysis it’s more important for the data to be comprehensive (large) and accounted consistently.

Analytic speed is always a trade-off between data volume and consistency – you have to be concious of that when you design your system.

The one property you’re never going to want to intentionally sacrifice is data volume – data is a business asset. Data has inherent value. You want to design your analytic systems to consume and retain as much of it as possible.

At MarkedUp we use a blend of both real-time and retrospective analytics:

markedup analytics blend

In this post and the next we’re going to focus on how we use Cassandra for real-time analytics – we use Hive and Hadoop for our retrospective analysis.

Why Cassandra for Real-time Analytics?

Cassandra is an excellent choice for real-time analytic workloads, if you value speed and data volume over consistency (which we do for many of our metrics.)

So what makes Cassandra attractive?

    • Cassandra is highly available and distributed; it has high tolerance to individual node failures and makes it possible to add multi-data center support easily if data affinity or sovereignty is an issue. On top of that it’s easy to expand a Cassandra cluster with new nodes if necessary (although this shouldn’t be done frivolously since there is a high cost to rebalancing a cluster.)
    • It has amazing write performance; we’ve clocked Cassandra writes taking up to  200µs on average for us, and that’s doubly impressive considering that most of our writes are big, heavily denormalized batch mutations.
    • Batch mutations give us the ability to denormalize data heavily and update lots of counters at once – in Cassandra it’s generally a good idea to write your data to make it easy to read back out, even if that means writing it multiple times. Batch mutations make this really easy and inexpensive for our front-end data collection servers.
    • Distributed counters were added to Cassandra due at Twitter’s insistence, and they’re a major boon to anyone trying to build real-time analytic systems. Most of MarkedUp’s real time analytics are implemented using counters – they provide a simple, inexpensive, and remarkably consistent mechanism to update metrics and statistics at write time. There are some trade offs (namely the loss of idempotency) but they make up for it in simplicity and speed.
    • Physically sorted columns are one of the Cassandra database implementation details worth learning, because with it you can create easily predictable and pre-sorted slices of data. This makes for really efficient storage of time-series data and other common types of analytic output. When you combined physically sorted columns with dynamic columns and slice predicates you can create lookup systems which retrieve large data sets in constant time.
    • Dynamic columns are a Cassandra feature that takes getting used to, but they are enormously powerful for analytic workloads when coupled with sorted columns – they allow you to create flexible, predictable data structures that are easy to read and extend.

We’re going to publish a series of posts about working with Cassandra for real-time analytics. Make sure you read part 2 where we go into detail on our read / write strategy with Cassandra for analytics!

MetroAppSite: Free, Open Source Metro-Style Website Templates for Your Windows Store Apps

Getting customers to notice and discover your Windows Store apps is hard, but you can reach users who aren’t inside the Windows Store using simple websites designed to promote your apps.

In addition, if your Windows Store app requires access to the Internet you are required by Windows Store policy to publish and link to a privacy policy hosted online (section 4.1.1.)

We decided to make life a little easier for Windows Store developers and built MetroAppSite – a fully responsive Metro-style website that uses Twitter Bootstrap and other standard frameworks to help developers promote their Windows Store apps.

And like most of our customers, we’re a .NET shop, so we built an ASP.NET MVC4 version of MetroAppSite too!

Features

Here are some of the great features that you get with MetroAppSite:

Metro theming and branding

Give your promotional website the same Metro look-and-feel that your users experience when they download your app from the Windows Store.

We even include a Microsoft Surface screenshot carousel for you to use to show off your Windows Store app’s look-and-feel.

metro-branding-metroappsite

MetroAppSite uses BootMetro and Twitter Bootstrap to give Windows Store developers an easy-to-modify, brandable template they can use to their own ends.

Fully responsive and touch/mobile-friendly

MetroAppSite’s CSS and design is fully responsive and touch-optimized out of the box. It looks great in full-sized web browsers and on mobile devices too!

metroappsite-mobile
Integrates seamlessly with third party services like Google Analytics and UserVoice

Unfortunately there isn’t a MarkedUp Analytics for websites yet, but in the meantime we made it dead-simple to integrate MetroAppSite with Google Analytics so you can measure your pageviews and visitors.

uservoice-logo

Additionally, we added hooks to integrate UserVoice directly into your app’s site so you can collect feedback and support tickets from users easily and seamlessly. UserVoice is what we used for our customer support at MarkedUp and we’ve had a great experience with it!

Templated privacy policy in order to make it easy for you to satisfy Windows Store certification requirements

Writing privacy policies can be a pain, so we made it easy for you to generate a privacy policy for your app using PrivacyChoice.org. You can paste these right into MetroAppSite and meet Windows Store certification requirements easily and thoroughly.

Demo Sites

We created some simple MetroAppSite deployments for you so can see what they look like in production:

Download

MetroAppSite is licensed under the Apache 2.0 license and is free for you to use in commercial or non-commercial projects.

Contribution

We happily accept pull requests via Github.

Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack

When we first made MarkedUp Analytics available on an invite-only basis to back in September we had no idea how quickly the service would be adopted. By the time we completely opened MarkedUp to the public in December, our business was going gangbusters.

But we ran into a massive problem by the end of November: it was clear that RavenDB, our chosen database while we were prototyping our service, wasn’t going to be able to keep growing with us.

So we had to find an alternative database and data analysis system, quickly!

The Nature of Analytic Data

The first place we started was by thinking about our data, now that we were moving out of the “validation” and into the “scaling” phase of our business.

Analytics is a weird business when it comes to read / write characteristics and data access patterns.

In most CRUD applications, mobile apps, and e-commerce software you tend to see read / write characteristics like this:

Read and Write characteristics in a traditional application

This isn’t a controversial opinion – it’s just a fact of how most networked applications work. Data is read far more often than it’s written.

That’s why all relational databases and most document databases are optimized to cache frequently read items into memory – because that’s how the data is used in the vast majority of use cases.

In analytics though, the relationship is inverted:

analytics-readwrite-charactertistics

By the time a MarkedUp customer views a report on our dashboard, that data has been written to anywhere from 1,000 to 10,000,000 times since they viewed their report last. In analytics, data is written multiple orders of magnitude more frequently than it’s read.

So what implications does this have for our choice of database?

Database Criteria

Looking back to what went wrong with RavenDB, we determined that it was fundamentally flawed in the following ways:

  • Raven’s indexing system is very expensive on disk, which makes it difficult to scale vertically – even on SSDs Raven’s indexing system would keep indexes stale by as much as three or four days;
  • Raven’s map/reduce system requires re-aggregation once it’s written by our data collection API, which works great at low volumes but scales at an inverted ratio to data growth – the more people using us, the worse the performance gets for everyone;
  • Raven’s sharding system is really more of a hack at the client level which marries your network topology to your data, which is a really bad design choice – it literally appends the ID of your server to all document identifiers;
  • Raven’s sharding system actually makes read performance on indices orders of magnitude worse (has to hit every server in the cluster on every request to an index) and doesn’t alleviate any issues with writing to indexes – no benefit there;
  • Raven’s map/reduce pipeline was too simplistic, which stopped us from being able to do some more in-depth queries that we wanted; and
  • We had to figure out everything related to RavenDB on our own – we even had to write our own backup software and our own indexing-building tool for RavenDB; there’s very little in the way of a RavenDB ecosystem.

So based on all of this, we decided that our next database system needed to be capable of:

  1. Integrating with Hadoop and the Hadoop ecosystem, so we could get more powerful map/reduce capabilities;
  2. “Linear” hardware scale – make it easy for us to increase our service’s capacity with better / more hardware;
  3. Aggregate-on-write – eliminate the need to constantly iterate over our data set;
  4. Utilizing higher I/O – it’s difficult to get RavenDB to move any of its I/O to memory, hence why it’s so hard on disk;
  5. Fast setup time – need to be able to move quickly;
  6. Great ecosystem support – we don’t want to be the biggest company using whatever database we pick next.

The Candidates

Based on all of the above criteria, we narrowed down the field of contenders to the following:

  1. MongoDB
  2. Riak
  3. HBase
  4. Cassandra

Evaluation Process

The biggest factor to consider in our migration was time to deployment – how quickly could we move off of Raven and restore a high quality of service for our customers?

We tested this in two phases:

  1. Learning curve of the database – how long would it take us to set up an actual cluster and a basic test schema?
  2. Acceptance test – how quickly could we recreate a median-difficulty query on any of these systems?

So we did this in phases, as a team – first up was HBase.

HBase

HBase was highly recommended to us by some of our friends on the analytics team at Hulu, so this was first on our list. HBase has a lot of attractive features and satisfied most of our technical requirements, save the most important one – time to deployment.

The fundamental problem with HBase is that cluster setup is difficult, particularly if you don’t have much JVM experience (we didn’t.) It also has a single point of failure (edit: turns out this hasn’t been an issue since 0.9x,) is a memory hog, and has a lot of moving parts.

That being said, HBase is a workhorse – it’s capable of handling immensely large workloads. Ultimately we decided that it was overkill for us at this stage in our company and the setup overhead was too expensive. We’ll likely revisit HBase at some point in the future though.

Riak

Riak One of our advisors is a heavy Riak user, so we decided it was worth exploring. Riak, on the surface, is a very impressive database – it’s heinously easy to set up a cluster and the HTTP REST API made it possible for us to test it using only curl.

After getting an initial 4-node cluster setup and writing a couple of “hello world” applications, we decided that it was time to move onto phase 2: see how long it would take to port a real portion of our analytics engine over to Riak.

I decided to use Node.JS for this since there’s great node drivers for both Raven and Riak and it was frankly a lot less work than C#. I should point out that CorrugatedIron is a decent C# driver for Riak though.

So, it took me about 6 hours to write the script to migrate a decent-sized data set into Riak – just enough to simulate a real query for a single MarkedUp app.

Once we had the data stuffed into our Riak cluster I wrote a simple map/reduce query using JavaScript and ran it – took 90 seconds to run a basic count query. Yeesh. And this map/reduce query even used key filtering and all of the other m/r best practices for Riak.

Turns out that Map/Reduce performance with the JavaScript VM is atrocious and well-known in Riak.

So, I tried a query using the embedded Erlang console using only standard modules – 50 seconds.

Given the poor map/reduce performance and the fact that we’d all have to learn Erlang, Riak was out. Riak is a pretty impressive technology and it’s easy to set up, but not good for our use case as is.

MongoDB

mongodb I’ve used MongoDB in production before and had good experiences with it. Mongo’s collections / document system is nearly identical to RavenDB, which gave it a massive leg up in terms of migration speed.

On top of that, Mongo has well-supported integration with Hadoop and its own aggregation framework.

Things were looking good for Mongo – I was able to use Node.JS to replicate the same query I used to test Riak and used the aggregation framework to get identical results within 3 hours of starting.

However, the issue with MongoDB was that it required us to re-aggregate all of our data regularly and introduced a lot of operational complexity for us. At small scale, it worked great, but under a live load it would be very difficult to manage Mongo’s performance, especially when adding new features to our analytics engine.

We didn’t write Mongo off, but we decided to take a look at Cassandra first before we made our decision.

Cassandra

File:Cassandra logo.pngWe started studying Cassandra more closely when we were trying to determine if Basho had any future plans for Riak which included support for distributed counters.

Cassandra really impressed us from the get-go – it would require a lot more schema / data modeling than Riak or MongoDB, but its support for dynamic columns and distributed counters solved a major problem for us: being able to aggregate most statistics as they’re written, rather than aggregating them with map/reduce afterwards.

On top of that, Cassandra’s slice predicate system gave us a constant-time lookup speed for reading time-series data back into all of our charts.

But Cassandra didn’t have all of the answers – we still needed map/reduce for some queries (ones that can’t or shouldn’t be done with counters) and we also needed the ability to traverse the entire data set.

Enter DataStax Enterprise Edition – a professional Cassandra distribution which includes Hive, Hadoop, Solr, and OpsCenter for managing backups and cluster health. It eliminated a ton of setup overhead and complexity for us and dramatically shortened our timeline to going live.

Evaluating Long-Term Performance

Cassandra had MongoDB edged out on features, but we still needed to get a feel for Cassandra’s performance. eBay uses Cassandra for managing time-series data that is similar to ours (mobile device diagnostics) to the tune of 500 million events a day, so we were feeling optimistic.

Our performance assessment was a little unorthodox – after we had designed our schema for Cassandra we wrote a small C# driver using FluentCassandra and replayed a 100GB slice of our production data set (restored from backup on a new RavenDB XL4 EC2 machine with 16 cores, 64GB of RAM, and SSD storage) to the Cassandra cluster; this simulated four month’s worth of production data written to Cassandra in… a little under 24 hours.

We used DataStax OpsCenter to graph the CPU, Memory, I/O, and latency over all four of our writeable nodes over the entire migration. We set our write consistency to 1, which is what we use in production.

Here are some interesting benchmarks – all of our Cassandra servers are EC2 Large Ubuntu 12.04 LTS machines:

  1. During peak load, our cluster completed 422 write requests per second – all of these operations were large batch mutations with hundreds rows / columns at once. We weren’t bottlenecked by Cassandra though – we were bottlenecked by our read speed pulling data out RavenDB.
  2. Cassandra achieved a max CPU utilization of 5%, with an average utilization of less than 1%.
  3. The amount of RAM consumed remained pretty much constant regardless of load, which tells me that our memory requirements never exceeded the pre-allocated buffer on any individual node (although we’ve spiked it since during large Hive jobs.)
  4. Cassandra replicated the contents of our 100GB RavenDB data set 3 times (replication factor of 3 is the standard) and our schema denormalized it heavily – despite both of those factors (which should contribute to data growth) Cassandra actually compressed our data set down to a slim 30GB, which provided us with storage savings of nearly 1000%! This is due to the fact that RavenDB saves its data as tokenized JSON documents, whereas everything is as byte arrays in Cassandra (layman’s terms.)
  5. Maximum write latency for Cassandra was 70731µs per operation with an an average write latency of 731µs. Under normal loads the average write latency is around 200µs.

Our performance testing tools ran out of gas long before Cassandra did. Based on our ongoing monitoring of Cassandra we’ve observed that our cluster is operating at less than 2% capacity under our production load. We’ll see how that changes once we start driving up the amount of Hive queries we run on any given day.

We never bothered running this test with MongoDB – Cassandra already had a leg up feature-set wise and the performance improvements were so remarkably good that we just decided to move forward with a full migration shortly after reviewing the results.

Hive and Hadoop

The last major piece of our stack is our map/reduce engine, which is powered by Hive and Hadoop.

Hadoop is notoriously slow, but that’s ok. We don’t serve live queries with it – we batch data periodically and use Hive to re-insert it back into Cassandra.

Hive is our tool of choice for most queries, because it’s an abstraction that feels intuitive to our entire team (lots of SQL experience) and is easy to extend and test on the fly. We’ve found it easy to tune and it integrates well with the rest of DataStax Enterprise Edition.

Conclusion

It’s important to think carefully about your data and your technology choices, and sometimes it can be difficult to do that in a data vacuum. Cassandra, Hive, and Hadoop ended up being the right tools for us at this stage, but we only arrived at that conclusion after actually doing live acceptance tests and performance tests.

Your mileage may vary, but feel free to ask us questions in the comments!

Microsoft Surface Adoption Worldwide

We made MarkedUp Analytics privately available to some Windows 8 developers in September, and thus we’ve had a chance to watch the Windows 8 ecosystem grow since well prior to its official 10/26 launch.

markedup-microsoft-surfaceAs many of you may have read this past week, Windows 8 sold over 40,000,000 licenses in its first month since release. That’s huge!

However, what about the Surface RT tablet Microsoft released on the same day? How well has it sold since?

MarkedUp Analytics was installed into some of the biggest apps in the Windows Store a month prior to the launch of Microsoft Surface; that puts us in a good position to use our data to make some educated inferences as to how well the Surface has really fared in the device marketplace.

Surface and the Windows 8 OEM Landscape

Before we jump into the specifics of Microsoft Surface, let’s consider the Windows 8 OEM ecosystem.

Since 9/28, MarkedUp has observed 307 distinct PC device manufacturers in our global data set for Windows 8 apps.

OEMs like HP, Dell, and Samsung still have a significant presence in the Windows 8 market, and the majority of it from devices that have been upgraded from Windows 7 and XP.

These traditional PC manufacturers also had a small, but statistically significant head-start over Microsoft in terms of total market share, because developers and big enterprises have had early access to the full verison Windows 8 since 8/15.

Windows 8 Market Share by OEM

This chart represents total market share by OEM across all devices that have used an app with MarkedUp installed in it since 10/26 until 11/24/2012, spanning roughly one month since Windows 8 and Microsoft Surface officially launched.

According to our data set, Microsoft has only one device in market – the Surface RT tablet. Our data set showed that Microsoft had statistically 0.0% market share prior to 10/26*, the day Surface and Windows 8 officially went on sale.

Microsoft’s 7.77% market share on this chart is represented solely by the adoption of the Surface RT tablet, and making Microsoft the 4th most popular OEM among Windows 8 users currently.

This number is also reflected in our analysis across all Windows 8 device models, rather than manufacturers:

Microsoft Surface Total Adoption v11-24-2012

MarkedUp has observed 11,385 distinct Windows 8 device models as of 11/24, and most of them are upgraded Windows 7 / Windows XP devices.

Microsoft Surface is by far the single most-used Windows 8 device from this cornucopia of hardware, occupying roughly 7.76% of the market.

The next most-used device model is the Samsung Sens Series laptop, like the Series 9 ultrathin notebook, with 3.31% market share, less than half of what the Surface RT has.

So with all of this market share data in mind, what’s the adoption rate for Microsoft Surface thus far?

Microsoft Surface Adoption Rate

So how quickly has the Surface RT tablet been adopted worldwide?

Well, we don’t have the absolute numbers since MarkedUp doesn’t have 100% market penetration across every unique Windows 8 device (working on it!) but we do have more than enough data to draw some inferences about the rate as which Surface RT tablets are being adopted.

The following chart shows the cumulative growth of the Surface RT’s installation base:

Microsoft Surface Daily Adoption v11-24-2012

As we mention in the callout on this chart, we decided that the best way to plot the growth of the Surface was to create an index and plot all of the cumulative growth relative to the index.

We set the index value 1 to be equal to the number of Surface RT tablets we saw activated on 10/26, the day it first went on sale. The final value on this chart has an index value of 120 for 11/24/2012, 29 days after the Surface went on sale initially – meaning that there were 120 times as many Surfaces activated by 11/24 than there were on 10/26.

So if Microsoft sold 10,000 Surfaces on day 1, then by the rate of growth on this chart they will have sold at least 1,200,000 units by 11/24.

Remember, this chart shows active devices that are being used and have consumed apps from the Windows Store, not devices that have been sold. The numbers on MarkedUp’s charts are effectively a floor for sales given that devices are sold before they’re used.

Microsoft Surface Adoption by Country

So we’ve shown you how quickly Surface RT tablets are being activated, but what about where they’re being activated?

Microsoft Surface Usage by Country v11-30-2012

MarkedUp has observed active Surface RT activity from users in 70 countries on 6 continents thus far, so the Surface is appears to be making inroads on Microsoft’s promise of broad international distribution for Windows 8 and Windows Store app developers.

In the chart above we broke out the percentage of Surface RT distribution by country including the 10 largest markets; the subsequent 60 markets all trail off quickly.

The United States has an overwhelming 68.52% share of all Surface RT tablets activated thus far with the UK coming in at a distant second with 9.10% share.

Our numbers across all Windows 8 devices are slightly different, but the US and UK both have dominate leads in those figures too.

One factor that may skew MarkedUp’s numbers towards the English-speaking world is that many app publishers forgo full international distribution in the Windows Store due to the fact that many parts of the world, including China and countries that have tighter content restriction laws, lengthen the Windows Store approval process and can even cause the app to be rejected outright.

So on that note, we strongly suspect that China in particular is under-represented on this chart given that it’s a massive market, but one that is more difficult for many app publishers to reach due to content restrictions.

Conclusions

Based on the data above, here is what we conclude:

  • The Microsoft Surface is the most heavily used ARM device in market for Windows 8 by a wide margin thus far and it is the single most-used device overall for Windows 8;
  • Surface’s growth appears to be strong, but it’s difficult to extrapolate the absolute number of units have been sold without knowing what the total day 1 sales were;
  • Surface RT is being adopted in primarily English-speaking countries, but has broad international reach; and
  • The majority of devices in market for Windows 8 are upgrades from previous versions of Windows, not new devices that came with Windows 8 installed; we’ll see how this changes as we collect more data from the Holiday season. The fact that the Samsung Sens Series made a strong appearance on our device model breakout shows signs of a growing ecosystem of net new Windows 8 machines from non-Microsoft OEMs.

Thanks for reading! If you’re a Windows 8 developer and would like access to the beta of MarkedUp Analytics for Windows 8, click here!

Appendix

Here are some other interesting statistics from our OEM data set:

  • The remaining 24.48% OEM market share not shown on the OEM chart represents 296 long-tail, smaller OEMs including VMWare virtual machines and a number of motherboard manufacturers used in home-made PCs.
  • There are three different device architectures that Windows 8 supports: ARM, x86, and x64. Surface is the only major ARM device in market thus far, although there are more ARM (RT) tablets on the way. In our public Windows 8 launch data set, we’ve observed the following trend consistently since the Windows 8 launch on 10/26:
    1. x64, 64-bit Intel hardware, is used by roughly 70% of the daily active usersfor the entire Windows 8 ecosystem every day;
    2. x86, 32-bit Intel hardware, is used by roughly 20%; and
    3. ARM, the new architecture for lightweight tablets like the Surface, is used by the remaining 10% of daily active users.

*MarkedUp observed some Microsoft Surface RT devices appear as early as 10/18 in our data set, but not enough to be statistically significant. We suspect that they were preview devices given to select app partners, press, and others with early access.

How-to: Tracking Page Navigation in Your WinRT or WinJS Applications

Today MarkedUp released a new report type which shows how often users view all of the distinct pages inside your Windows Store applications on Windows 8; so how do you how to actually capture all of this page navigation information in your app using MarkedUp?

Easy! You can use a single method to automatically track all of the page navigation information inside of your C# / C++ / HTML5 Windows Store apps: RegisterNavigationFrame.

The RegisterNavigationFrame method allows MarkedUp to automatically detect how users navigate between pages inside your application and eliminates the need for developers to have to write their own handler code and call the PageEnter event manually.

Calling RegisterNavigationFrame in C# / XAML

If you’re using C# and XAML to develop your Windows Store applications, you’ll want to call MarkedUp.AnalyticClient.RegisterNavigationFrame inside the Application.OnLaunched method, as shown below:

Calling registerNavigationFrame in HTML5 / JavaScript

If you’re using HTML5 and JavaScript to develop your Windows Store applications, you’ll want to call MK.registerNavigationFrame inside the WinJS app.onactivated method, as shown below:

Pretty simple! If you add this single line of code to your Windows Store applications you’ll see data light up on your application page views report and on others that we’re actively developing right now!

Let us know in the comments if you have any questions.

New Feature: Viewing Application Page Views in MarkedUp

Since MarkedUp opened up its public beta in late September, our number one most requested feature has been to enable developers to view all of the page navigation data we capture in the RegisterNavigationFrame method. We’ve captured this data all along, but haven’t had a report for viewing it!

Specifically, this method captures all page navigation events that occur inside an application – whenever a user transitions from one page to another, we automatically capture a PageEnter event and push the results back to MarkedUp.

Now you can see the results using our “application page views” report under the engagement tab on MarkedUp!

Application Page Views report, under the engagement tab in MarkedUp After you drill into the report, you’ll see a breakdown of your monthly traffic to each of the different pages in your app.

Here’s what the chart looks like for a standard WinRT app built using C# and XAML:

Application Page Views for WinRT application using MarkedUp And here’s the same chart for WinJS:

Application Page Views for WinJS application using MarkedUp If you click on any of the page names, you’ll have a chance to see a breakout chart with daily total page views and average views per-session:

Application Page views drill down We listened to your feedback and prioritized this report ahead of the others that we’re always working on. If you have any other product suggestions for MarkedUp, submit them here!

5 Key Themes from Microsoft on the Future of Windows and WinRT from the //BUILD Keynote

Build Windows 8

This week I’m attending //BUILD conference in Redmond, WA on Microsoft’s main campus alongside thousands of other .NET / Windows developers. The keynote ended about an hour ago and I wanted to publish my thoughts on some of the important takeaways from Ballmer’s talk.

Microsoft’s Points of Emphasis

1. “Microsoft can only win by training consumers to expect consistent behavior, availability, and synchronized data across all of their different devices”

WinRT isn’t just about tablets – it’s also about fundamentally changing the way desktop software is consumed and unifying mobile / desktop / tablet and probably console apps all under one consolidated platform.

The unification of these platforms is the future of Microsoft; training consumers to expect consistent behavior and access to data across all of their devices is the only way Microsoft will be able to dethrone Apple and Google in mobile / tablet and protect themselves in desktop / console in the long-run.

Ultimately, Microsoft is really the only company that can execute well on native software, services, and devices. They are playing to their strengths (ecosystem and platform) and are doing it well here.

2. “The fate of WinRT is in the hands of developers big and small.”

Microsoft desperately needs developers to make WinRT a success.

Microsoft, for the first time since Win32 emerged as the victor in the desktop wars of old, is in a position where it needs developers more than they need Microsoft.

The unified vision behind WinRT will not work without the buy-in of developers both big and small, from Facebook to the individual hobbyist developer.

Microsoft will do the hard work of putting devices in the hands of consumers that bring WinRT applications to the forefront (which I suspect is the real reason why the ARM-only Surface shipped so far ahead of the Intel one.) But it is totally reliant on developers to put the content in-store that consumers actually want to use.

3. “Microsoft and Nokia will work themselves to death to win the support of developers.”

Compounding key takeaway #2, Microsoft and Nokia both made commitments to put hardware (Nokia phones, Surfaces for us //BUILD attendees) into the hands of developers who build apps.

Having worked at Microsoft Developer Platform Evangelism throughout the entire WP7 push, I can tell you that this is no joke – Microsoft will find a way to arm its developers with hardware now that it’s all generally available.

But they’re not stopping there – Microsoft is going to continue to push training events, hackathons, webcasts, and everything it can possibly do to train developers and make it easier than ever to learn a new platform and actually ship an app on it.

I think this is tremendously positive and every developer who’s interested in the platform will have multiple opportunities to learn it on Microsoft’s dime.

4. “Don’t ship apps that don’t leverage the platform.”

Reading between the lines in some of the keynote speeches and the first couple of sessions I poked my nose in, you can interpret the following from Microsoft:

Developers who carbon copy their work byte-by-byte from previous platforms, including web apps, are doing themselves a disservice and will have their lunch eaten by developers who take advantage of charms, live tiles, and all of the other unique built-in features to Windows 8. Please take advantage of the platform if you’re going to build an app for it!

This echoes the same general theme I wrote about in an earlier post decrying Windows 8 developers shooting themselves in the foot with respect to Windows Store economics: Windows 8 is different so treat it differently than iOS / Android / Web.

Speaking more broadly, most consumers have never interacted with Metro much, aside from perhaps the Xbox launch screen.

If Metro and WinRT are going to take off with consumers then it will be due to the highly differentiated hardware and software capabilities of the platform, not price point or any other factors.

If developers don’t take advantage of those differentiated OS capabilities, then it limits the overall differentiation of Windows 8 from everything else in market, including prior versions of Windows.

5. “Windows Phone 8 is just as important to the success of Microsoft as Windows 8.”

Extending theme #1 a little bit further… Windows Phone 8 was heavily, heavily emphasized over Windows 8 itself during Ballmer’s talk… We’d heard hardly a peep about it prior to it’s official launch yesterday.

Here’s why: the success of Microsoft’s entire consumer software ecosystem rides on consumers adopting the Metro UI and getting used to the rest of Microsoft’s services ecosystem, which includes your apps in the Windows Store.

Windows 8 will move hundreds of millions of units regardless, due to the inertia of Microsoft’s desktop / laptop business alone. The rise of Microsoft Surface devices we’ve seen on our Windows 8 launch tracker also makes the future for WinRT tablets look a lot more promising than it was a week ago.

However, without a significant presence in mobile, Microsoft and Windows will always have the threat of a unified iOS / OS X ecosystem there to sweep the desktop market out from underneath it. Windows Phone 8 is not a side show – it’s part of the core front Microsoft is forming against Apple on consumer computing.

Parting Thoughts

This is a really exciting time to be a Windows developer. The opportunities for developers to build sustainable businesses around Windows 8 and Windows Phone 8 apps are huge and there for the taking.

On top of that, the ecosystem has never been more accessible – you can build native apps for Windows 8 and Windows Phone 8 with C# or C++, and I suspect we’ll eventually see WinJS apps make their way onto Windows Phone 8 too.

That’s why our team at MarkedUp is excited to be doing what we’re doing :)

Windows 8 Launch Tracker: Follow the Adoption of Windows 8 as it Happens

windows8logoHave you been reading the news about Windows 8 lately?

It’s October 26th – Windows 8 is finally here, as is the Microsoft Surface WinRT tablet! And as the links above show, consumers and developers alike are really, really excited.

MarkedUp has been diligently helping Windows 8 developers measure how their users consume their apps since September, and so we have a unique opportunity to use our growing data set to help curious onlookers and technology enthusiasts get to track the adoption of Windows 8 as it happens.

So, without further adieu, allow us to introduce you to our Windows 8 Launch Tracker!

Windows 8 Launch Tracker - Powered by MarkedUp AnalyticsWe’re going to update the statistics daily and help developers track how quickly Windows 8 is picked up by the community at large, using our entire data set.

Sampling Methodology

Our methodology for sampling the data displayed in the charts is straightforward: we take a seven day rolling average of all active users and new installations detected from an app and calculate rate of change between them.

There are some other things we do to try to prevent outliers from spiking the graph (i.e. apps that acquire a large number of users rapidly, usually popular titles ported from other platforms) but generally it’s all just rate of changes against a moving average of new devices activated and daily active users.

You’ll notice a big surge on the 19th – that’s due to a trend that started on the 15th of October where the Windows Store approved nearly 20% of the current apps that are in market now (roughly 5000 apps in market,) which subsequently lead to a big surge in our numbers.

If the Windows Store goes through another sustained round of high-volume approvals that will similarly spike our numbers again.

We’re working on refining our methodology for the “Windows 8 by chipset architecture” graph at the bottom since we expect it to change radically with the availability of new ARM devices.

We’re going to email out more detailed trends and analysis on the growth of the Windows 8 ecosystem; if you want more detailed reports on trends with Windows 8, sign up for our newsletter!