Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack

When we first made MarkedUp Analytics available on an invite-only basis to back in September we had no idea how quickly the service would be adopted. By the time we completely opened MarkedUp to the public in December, our business was going gangbusters.

But we ran into a massive problem by the end of November: it was clear that RavenDB, our chosen database while we were prototyping our service, wasn’t going to be able to keep growing with us.

So we had to find an alternative database and data analysis system, quickly!

The Nature of Analytic Data

The first place we started was by thinking about our data, now that we were moving out of the “validation” and into the “scaling” phase of our business.

Analytics is a weird business when it comes to read / write characteristics and data access patterns.

In most CRUD applications, mobile apps, and e-commerce software you tend to see read / write characteristics like this:

Read and Write characteristics in a traditional application

This isn’t a controversial opinion – it’s just a fact of how most networked applications work. Data is read far more often than it’s written.

That’s why all relational databases and most document databases are optimized to cache frequently read items into memory – because that’s how the data is used in the vast majority of use cases.

In analytics though, the relationship is inverted:

analytics-readwrite-charactertistics

By the time a MarkedUp customer views a report on our dashboard, that data has been written to anywhere from 1,000 to 10,000,000 times since they viewed their report last. In analytics, data is written multiple orders of magnitude more frequently than it’s read.

So what implications does this have for our choice of database?

Database Criteria

Looking back to what went wrong with RavenDB, we determined that it was fundamentally flawed in the following ways:

  • Raven’s indexing system is very expensive on disk, which makes it difficult to scale vertically – even on SSDs Raven’s indexing system would keep indexes stale by as much as three or four days;
  • Raven’s map/reduce system requires re-aggregation once it’s written by our data collection API, which works great at low volumes but scales at an inverted ratio to data growth – the more people using us, the worse the performance gets for everyone;
  • Raven’s sharding system is really more of a hack at the client level which marries your network topology to your data, which is a really bad design choice – it literally appends the ID of your server to all document identifiers;
  • Raven’s sharding system actually makes read performance on indices orders of magnitude worse (has to hit every server in the cluster on every request to an index) and doesn’t alleviate any issues with writing to indexes – no benefit there;
  • Raven’s map/reduce pipeline was too simplistic, which stopped us from being able to do some more in-depth queries that we wanted; and
  • We had to figure out everything related to RavenDB on our own – we even had to write our own backup software and our own indexing-building tool for RavenDB; there’s very little in the way of a RavenDB ecosystem.

So based on all of this, we decided that our next database system needed to be capable of:

  1. Integrating with Hadoop and the Hadoop ecosystem, so we could get more powerful map/reduce capabilities;
  2. “Linear” hardware scale – make it easy for us to increase our service’s capacity with better / more hardware;
  3. Aggregate-on-write – eliminate the need to constantly iterate over our data set;
  4. Utilizing higher I/O – it’s difficult to get RavenDB to move any of its I/O to memory, hence why it’s so hard on disk;
  5. Fast setup time – need to be able to move quickly;
  6. Great ecosystem support – we don’t want to be the biggest company using whatever database we pick next.

The Candidates

Based on all of the above criteria, we narrowed down the field of contenders to the following:

  1. MongoDB
  2. Riak
  3. HBase
  4. Cassandra

Evaluation Process

The biggest factor to consider in our migration was time to deployment – how quickly could we move off of Raven and restore a high quality of service for our customers?

We tested this in two phases:

  1. Learning curve of the database – how long would it take us to set up an actual cluster and a basic test schema?
  2. Acceptance test – how quickly could we recreate a median-difficulty query on any of these systems?

So we did this in phases, as a team – first up was HBase.

HBase

HBase was highly recommended to us by some of our friends on the analytics team at Hulu, so this was first on our list. HBase has a lot of attractive features and satisfied most of our technical requirements, save the most important one – time to deployment.

The fundamental problem with HBase is that cluster setup is difficult, particularly if you don’t have much JVM experience (we didn’t.) It also has a single point of failure (edit: turns out this hasn’t been an issue since 0.9x,) is a memory hog, and has a lot of moving parts.

That being said, HBase is a workhorse – it’s capable of handling immensely large workloads. Ultimately we decided that it was overkill for us at this stage in our company and the setup overhead was too expensive. We’ll likely revisit HBase at some point in the future though.

Riak

Riak One of our advisors is a heavy Riak user, so we decided it was worth exploring. Riak, on the surface, is a very impressive database – it’s heinously easy to set up a cluster and the HTTP REST API made it possible for us to test it using only curl.

After getting an initial 4-node cluster setup and writing a couple of “hello world” applications, we decided that it was time to move onto phase 2: see how long it would take to port a real portion of our analytics engine over to Riak.

I decided to use Node.JS for this since there’s great node drivers for both Raven and Riak and it was frankly a lot less work than C#. I should point out that CorrugatedIron is a decent C# driver for Riak though.

So, it took me about 6 hours to write the script to migrate a decent-sized data set into Riak – just enough to simulate a real query for a single MarkedUp app.

Once we had the data stuffed into our Riak cluster I wrote a simple map/reduce query using JavaScript and ran it – took 90 seconds to run a basic count query. Yeesh. And this map/reduce query even used key filtering and all of the other m/r best practices for Riak.

Turns out that Map/Reduce performance with the JavaScript VM is atrocious and well-known in Riak.

So, I tried a query using the embedded Erlang console using only standard modules – 50 seconds.

Given the poor map/reduce performance and the fact that we’d all have to learn Erlang, Riak was out. Riak is a pretty impressive technology and it’s easy to set up, but not good for our use case as is.

MongoDB

mongodb I’ve used MongoDB in production before and had good experiences with it. Mongo’s collections / document system is nearly identical to RavenDB, which gave it a massive leg up in terms of migration speed.

On top of that, Mongo has well-supported integration with Hadoop and its own aggregation framework.

Things were looking good for Mongo – I was able to use Node.JS to replicate the same query I used to test Riak and used the aggregation framework to get identical results within 3 hours of starting.

However, the issue with MongoDB was that it required us to re-aggregate all of our data regularly and introduced a lot of operational complexity for us. At small scale, it worked great, but under a live load it would be very difficult to manage Mongo’s performance, especially when adding new features to our analytics engine.

We didn’t write Mongo off, but we decided to take a look at Cassandra first before we made our decision.

Cassandra

File:Cassandra logo.pngWe started studying Cassandra more closely when we were trying to determine if Basho had any future plans for Riak which included support for distributed counters.

Cassandra really impressed us from the get-go – it would require a lot more schema / data modeling than Riak or MongoDB, but its support for dynamic columns and distributed counters solved a major problem for us: being able to aggregate most statistics as they’re written, rather than aggregating them with map/reduce afterwards.

On top of that, Cassandra’s slice predicate system gave us a constant-time lookup speed for reading time-series data back into all of our charts.

But Cassandra didn’t have all of the answers – we still needed map/reduce for some queries (ones that can’t or shouldn’t be done with counters) and we also needed the ability to traverse the entire data set.

Enter DataStax Enterprise Edition – a professional Cassandra distribution which includes Hive, Hadoop, Solr, and OpsCenter for managing backups and cluster health. It eliminated a ton of setup overhead and complexity for us and dramatically shortened our timeline to going live.

Evaluating Long-Term Performance

Cassandra had MongoDB edged out on features, but we still needed to get a feel for Cassandra’s performance. eBay uses Cassandra for managing time-series data that is similar to ours (mobile device diagnostics) to the tune of 500 million events a day, so we were feeling optimistic.

Our performance assessment was a little unorthodox – after we had designed our schema for Cassandra we wrote a small C# driver using FluentCassandra and replayed a 100GB slice of our production data set (restored from backup on a new RavenDB XL4 EC2 machine with 16 cores, 64GB of RAM, and SSD storage) to the Cassandra cluster; this simulated four month’s worth of production data written to Cassandra in… a little under 24 hours.

We used DataStax OpsCenter to graph the CPU, Memory, I/O, and latency over all four of our writeable nodes over the entire migration. We set our write consistency to 1, which is what we use in production.

Here are some interesting benchmarks – all of our Cassandra servers are EC2 Large Ubuntu 12.04 LTS machines:

  1. During peak load, our cluster completed 422 write requests per second – all of these operations were large batch mutations with hundreds rows / columns at once. We weren’t bottlenecked by Cassandra though – we were bottlenecked by our read speed pulling data out RavenDB.
  2. Cassandra achieved a max CPU utilization of 5%, with an average utilization of less than 1%.
  3. The amount of RAM consumed remained pretty much constant regardless of load, which tells me that our memory requirements never exceeded the pre-allocated buffer on any individual node (although we’ve spiked it since during large Hive jobs.)
  4. Cassandra replicated the contents of our 100GB RavenDB data set 3 times (replication factor of 3 is the standard) and our schema denormalized it heavily – despite both of those factors (which should contribute to data growth) Cassandra actually compressed our data set down to a slim 30GB, which provided us with storage savings of nearly 1000%! This is due to the fact that RavenDB saves its data as tokenized JSON documents, whereas everything is as byte arrays in Cassandra (layman’s terms.)
  5. Maximum write latency for Cassandra was 70731µs per operation with an an average write latency of 731µs. Under normal loads the average write latency is around 200µs.

Our performance testing tools ran out of gas long before Cassandra did. Based on our ongoing monitoring of Cassandra we’ve observed that our cluster is operating at less than 2% capacity under our production load. We’ll see how that changes once we start driving up the amount of Hive queries we run on any given day.

We never bothered running this test with MongoDB – Cassandra already had a leg up feature-set wise and the performance improvements were so remarkably good that we just decided to move forward with a full migration shortly after reviewing the results.

Hive and Hadoop

The last major piece of our stack is our map/reduce engine, which is powered by Hive and Hadoop.

Hadoop is notoriously slow, but that’s ok. We don’t serve live queries with it – we batch data periodically and use Hive to re-insert it back into Cassandra.

Hive is our tool of choice for most queries, because it’s an abstraction that feels intuitive to our entire team (lots of SQL experience) and is easy to extend and test on the fly. We’ve found it easy to tune and it integrates well with the rest of DataStax Enterprise Edition.

Conclusion

It’s important to think carefully about your data and your technology choices, and sometimes it can be difficult to do that in a data vacuum. Cassandra, Hive, and Hadoop ended up being the right tools for us at this stage, but we only arrived at that conclusion after actually doing live acceptance tests and performance tests.

Your mileage may vary, but feel free to ask us questions in the comments!

25 Responses to “Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack”

  1. Ayende Rahien

    Aaron,
    I wanted to answer a bit about your RavenDB usage.

    To start with, RavenDB is _designed_ for read often application. The entire stack is built with that in mind, as are a lot of the optimizations. As you noted, for the vast majority of the cases, that is what you would want.

    In RavenDB 2.0, we did major improvements to indexing. In particular, we use parallel IO to reduce a lot of cost of going to the disk all the time.

    Map/Reduce indexes also improved, and for large values, we can reduce in O(N/1024^3 * 3) instead of O(N).

    Sharding in RavenDB is actually not based on embedding the shard id in the document id. This is merely a default that we selected in order to make things blindly obivous.
    You can change the sharding function at any time, and it is fully support biasing, adding / removing new nodes, re-sharding, etc.
    Sharding is meant to be used with some idea about locallity, exactly because we can avoid hitting all of the servers.

    RavenDB comes with its own backup solution, as well as support for standard Enterprise backup tooling. I am not sure why you would want to build your own.

  2. Aaron

    Hi Ayende,

    Thanks for commenting! This post wasn’t intended to be a bash against RavenDB, but we really did run into some major problems with it.

    You make a lot of fantastic claims about RavenDB’s performance so we gave it the same sort of consideration as MongoDB: document-oriented and better at handling write-heavy loads than most RDBMS. In hindsight, we probably shouldn’t have used Raven given the lack of documentation on its performance under stress and its limitations – but you built a really interesting product and when we first prototyped MarkedUp we figured RavenDB would be suitable for validating our service (and indeed, it performed that job rather admirably.)

    That being said, let me address a few of your points:

    * We tried RavenDB 2.0 and didn’t notice any significant performance improvements during indexing (without a live load on it – just catching up on the old data set;) This was on an Amazon XL4 instance with 64GB RAM, 16 cores, and 2 TB of SSD storage.

    Speaking of RavenDB 2.0, you guys did a lot of funky stuff on the Raven Client that forced us to rewrite major portions of our app – limiting the number of async calls made in a single session to 1 was the issue, IIRC. I ended up scrapping the project and had us roll back to RavenDB 1.0 until we migrated off of it since there were just too many things we had to change at once in order for our app to support RavenDB 2.0.

    Either make the new database server backwards compatible with older clients or make it trivially easy to upgrade the database client.

    * We had a conversation about sharding in the RavenDB user group and even investigated creating our own sharding strategy; what you’re proposing is unfeasible even for advanced customers, to be honest. I want to pick a shard key and be done with it – not have to implement the sharding strategy myself or have my data touched. It’s the same reason why very few Cassandra developers use order-preserving partitioners.

    * Raven’s backup solution in 2.0 is much better than it was in 1.0; we did notice and appreciate that. 1.0 didn’t include any support for automated backups, so we wrote our own. We also had to write our own version of RavenDb.Smuggler since Smuggler 1.0 would throw JSON parse exceptions and crash mid-job during imports.

    You have a really innovative database solution, but it needs more time to mature. The fact that someone built a really great Node.JS driver for Raven (which we’ve used) on their own initiative is a good sign that you’re starting to form an ecosystem around it.

  3. Khalid Abuhakmeh (@AquaBirdConsult)

    Very interesting post. I am a user of RavenDB since 1.0 and 2.0. I love the technology. RavenDB 2.0 is a great improvement over 1.0, but there are times when I do question whether I’m living to close to the edge. The Hibernating Rhinos team is great, and responsive and are always on top of requests and issues so the thought rarely crosses my mind (but it does).

    I know this might come off as a weird request, but I would wish you and your team would be so kind as to open a line of communication to the Hibernating Rhinos team and talk about the issues you’ve run into and additionally offer advice on how you might solve those issues. You might not be on RavenDB now, but your expertise and experience might help the ecosystem grow to be a better one.

    Also if you are looking to scale like crazy, check out http://www.citusdata.com/. Their demo is really really impressive and all queries are run through SQL. Which may make it easier to transition, since you mentioned you are SQL guys.

    Thanks for this article, it is very enlightening.

    • Aaron

      Khalid,

      We tried to raise issues on the user group when we encountered them and often we were dismissed by Ayende or the Hibernating Rhinos team with the infamous words “send me a pull request with a failing unit test.”

      It’s not our job to fix bugs with Raven – it’s our job to build a great analytics product for our end-users. We’re happy to contribute back when we can – that’s why we wrote Hircine (index-building tool for Raven.) We’re also active FluentCassandra contributors (.NET driver for Cassandra.)

      I’m happy to answer questions about Raven and try to help other developers in the ecosystem, but I also think that we’re probably not a good use case for RavenDB. We’re far too write-heavy. Raven never really had any issues with read speed – it does that quite well.

      • Khalid Abuhakmeh (@AquaBirdConsult)

        Aaron,

        I understand completely where you are coming from. There are only 24 hours in a day, and you have to prioritize. I’ve had to give up time on things I loved to do to make time for things I have to do. I’m interested to see more on how you guys are using FluentCassandra.

        I think it is fair to say that RavenDB didn’t work out for you, and I don’t think anyone should attack you for that. MarkedUp looks like a great product and wish you guys the best of luck. I love seeing developers succeed.

        How long did it take you to do the conversion from RavenDB and how big is your team of developers?

        • Aaron

          Khalid,

          Yep, you nailed it. RavenDB is a really interesting take on databases and with a couple more years and some more big use cases hopefully they’ll have some of the issues metted out.

          It took us about two months to migrate, but mostly because the holidays happened right after we finished evaluating all of the databases we listed above.

          Once we settled on Cassandra and returned back from New Year’s it took us about 3-4 weeks to move everything over with a team of 3 people. I did the schema design and the majority of the coding for the service itself. The rest of the team wrote the migration tool, set up the clusters, ran all of the benchmarks, and researched Hive / Hadoop for some of the trickier queries we had to replace.

          • Khalid Abuhakmeh (@AquaBirdConsult)

            Any chance you could release small snippets of code of FluentCassandra code compared to RavenDB code. I’m curious to see how the two stack up from the developer experience standpoint?

            Also, during your 3-4 weeks were there moments that worried you?

            Also did you get a chance to look at CitusDB?

            Who are your team members? I’d love to follow you guys on twitter and see what you guys are up to.

  4. Chris Marisic

    I’m calling flat out bullshit on some of your statements. Actually I’m upping it to calling bullshit on the majority of your entire arguments.

    “even on SSDs Raven’s indexing system would keep indexes stale by as much as three or four days;” http://stackoverflow.com/questions/14873687/mongo-vs-raven-evaluation/14883055#comment21023104_14883055 You directly state your system has continuous write volume measured in thousands per minute. Indexes are always stale because you’re always writing! Indexes as BASE aka eventually consistent, it’s lunacy to think you would have a nonstale index with a never ending write load! The fact the indexes are stale in this scenario is a gigantic FEATURE. It’s one of the entire core premises of using RavenDB to get immense read volume from it on a single server.

    “Raven’s sharding system is really more of a hack at the client level which marries your network topology to your data, which is a really bad design choice – it literally appends the ID of your server to all document identifiers;”

    RavenDB has the most intelligent sharding system in modern databases. It is the only database that knows which shards contain your data, this allows you to have properly affinitized data without needing to do cross server requests!

    “Raven’s sharding system actually makes read performance on indices orders of magnitude worse (has to hit every server in the cluster on every request to an index) and doesn’t alleviate any issues with writing to indexes – no benefit there;””

    This is more nonsense. If you got here it’s because you undid the entire core premise to the shardId being part of the doucmentId. You directly chose to break all affinity of your data. If you used affinity the way it was meant to you would have never been in this scenario.

    “we even had to write our own backup software ”

    COMPLETE AND UTTER NONSENSE. I have had RavenDB applications hosted in production for over 2 years now. If by “write our own backup software” you mean you wrote a powershell file that calls RavenDB.Backup.exe, well congratulations on “your own software”!

    It makes me furious to see a person castigate technology with inherently false assertions.

  5. cuper

    How long does it take to finish the map reduce query on Cassandra, compared to 50 seconds by erlang on Riak?

    • Aaron

      Cuper,

      MapReduce with Hadoop is sloooooow. But that’s ok – since we use distributed counters to manage most of our time-series data now, we were able to eliminate 95% of use cases where we would need MapReduce.

      For reports that still need MapReduce, a Hive job that hits our entire data set can take as much as 17-30 minutes depending on what it does. That’s going to slow down as the data set grows, but we also timebox our M/R pretty aggressively – we only run it across the last 30 days worth of data or so.

    • Aaron

      I guess the point is: try to minimize your MapReduce needs as much as possible. Counters in HBase and Cassandra both do a great job with this.

  6. Aaron

    Hi Chris,

    You seem to be pretty emotionally invested in RavenDB, so I’m not sure how much good it will do me to answer your questions. I am not trying to “castigate” RavenDB in this post – all I did was explain what went wrong with it and why we chose to migrate to Cassandra.

    Nonetheless, I will do my best!

    “Indexes are always stale because you’re always writing! Indexes as BASE aka eventually consistent, it’s lunacy to think you would have a nonstale index with a never ending write load! The fact the indexes are stale in this scenario is a gigantic FEATURE.”

    I think I see what the issue is – what I mean by “the indexes were stale for four days” is that there were LITERALLY NO UPDATES to the index results for four days. All of our reports made it look like our service had gone down for fours days, when really the issue was the RavenDB was stuck in a really long-running indexing phase where it didn’t commit any results back.

    If that sounds like acceptable behavior in a real application to you, then I don’t know what to say.

    Eventual consistency still has a finite window to it – results should only be inconsistent for X amount of time after they were written In similar database systems, like MongoDB, X is usually 1-3 seconds. In Raven we’ve seen it take 1-4 days. That’s just not acceptable.

    “RavenDB has the most intelligent sharding system in modern databases. It is the only database that knows which shards contain your data, this allows you to have properly affinitized data without needing to do cross server requests!”

    This is patently false. I strongly recommend taking a database fundamentals course. Raven’s sharding model falls really far short of any other sharding system I’ve ever used.

    Every database sharding system works this way – consider consistent hashing and ring-based partitioning, which is what both Cassandra and Riak use for sharding data: http://docs.basho.com/riak/1.0.0/tutorials/fast-track/What-is-Riak/

    In that environment all of the row keys are evenly distributed across the cluster, are easy to rebalance when network topology changes, are easy to replicate safely across different physical nodes, and don’t require the client to have any knowledge of where the shard is.

    All of the Cassandra and Riak nodes are able to coordinate with each other to fulfill my request – any cross-server communication occurs via the internal gossip protocol over TCP (or protocol buffers in the case of Riak, IIRC), which is much more performant than the HTTP speeds you’re used to with Raven.

    And if data affinity for rows is really important (it usually isn’t) you can use an order-preserving partitioner to physically sort keys.

    This is technology that was invented by Facebook / Amazon and is actively used by Netflix, RackSpace, eBay, Twitter, and pretty much every other giant technology company you’ve ever heard of. They actually have to deal with these problems at scale and build best of breed solutions to tackle them – I doubt anyone in the RavenDB community has.

    That’s not to say that there’s anything “bad” about Raven per say, but it’s just not nearly as horizontally scalable as it proclaims to be.

    “This is more nonsense. If you got here it’s because you undid the entire core premise to the shardId being part of the doucmentId. You directly chose to break all affinity of your data. If you used affinity the way it was meant to you would have never been in this scenario.”

    Last I checked, RavenDB is an index-driven document database, not a key/value store. Why should the way I work with Raven RADICALLY change because I want to use sharding?

    Raven’s MapReduce indexes were what we used to generate time-series data – the index IS the data, in other words. Even when we were able to shard our system across app ids (the best shard key for us) Raven would still hit literally every server to serve an index, even when only one of them had the data that we wanted! There’s no way to specify a shard key for indexes, so Raven greedily hits all of the servers until it finds some results – it can either do that sequentially or in parallel.

    “If by “write our own backup software” you mean you wrote a powershell file that calls RavenDB.Backup.exe, well congratulations on “your own software”!”

    Shipping scheduled backups to S3 / Azure Blob Storage / offsite storage is pretty much standard practice for us; we even do that with Cassandra despite that fact that its replication system works really well (Raven’s didn’t – got stuck at 80% replication or less once we started doing 50+ writes per second.)

    Intially we tried using RavenDB.Backup and found that Raven was never able to complete the backup job on a hot server in 1.0. In RavenDB 2.0 the automated backups worked fine, but we weren’t able to ever take 2.0 into production due to the severity of the breaking changes in the RavenDB client.

    So we actually wrote our own system which used Smuggler to backup the files. We had a lot of issues getting Smuggler to restore properly but it actually worked.

    At a certain point, you just have to know when to give up on a technology and find something else. That’s what we did, not becaues we don’t like Raven but because this is what actually happened to us.

    We know Raven really quite well actually, looked at the source code for the driver and DB engine when we ran into issues, and tried asking for help on the forums regularly. It’s a database that comes with a lot of inconsistent behavior and surprises (async not working with sharding, for instance.) It needs a lot more time to mature.

    • Chris Marisic

      “I think I see what the issue is – what I mean by “the indexes were stale for four days” is that there were LITERALLY NO UPDATES to the index results for four days.”

      I have never once seen this in my life, and this was never reported by any user ever on RavenDB except using pre-release versions of RavenDB 2.0.

      “This is patently false. I strongly recommend taking a database fundamentals course. Raven’s sharding model falls really far short of any other sharding system I’ve ever used.”

      I patently disagree. RavenDB’s sharding model is superior because it completely eliminates the need for cross server talk if you intelligently shard.

      “They actually have to deal with these problems at scale and build best of breed solutions to tackle them”

      Ayende has just solved it superior to them. Just because companies are willing to sink big dollars into a project doesn’t mean they’re going to do it best.

      “Last I checked, RavenDB is an index-driven document database, not a key/value store. Why should the way I work with Raven RADICALLY change because I want to use sharding?”

      Once again, you show **fundamental misunderstanding** of RavenDB. There is absolutely no change to the use RavenDB with indexing and querying. Using a proper shard strategy RavenDB fundamentally knows where your data is and will not issue queries to all servers, unless you specifically craft queries that can only be satisfied by all servers.

      ” Raven would still hit literally every server to serve an index, even when only one of them had the data that we wanted! ”

      I can’t definitively say this behavior is how it operates sharded map/reduce indexes, but that is certainly not the case with map indexes.

      “Shipping scheduled backups to S3 / Azure Blob Storage / offsite storage is pretty much standard practice for us; ”

      All it takes is powershell to do that, we do it in production exactly as such.

      “(Raven’s didn’t – got stuck at 80% replication or less once we started doing 50+ writes per second.)” And never seen anyone raise this concern either.

      “Intially we tried using RavenDB.Backup and found that Raven was never able to complete the backup job on a hot server in 1.0.” And and never seen anyone raise this concern either.

      “We had a lot of issues getting Smuggler to restore properly but it actually worked.” And and and never seen anyone raise this concern either.

      You state you encountered 4 critical issues that never were posted by ANYONE in the RavenDB forums EVER, including your organization.

  7. HDFS

    While your HBase points seem sound in making your choice, the point of it having a Single Point of Failure is incorrect and ought to be removed (it is aiding in spreading FUD). HDFS no longer has a SPOF with its 2.x releases recently, and HBase itself has never had a SPOF since 0.90 at least. Hence, there is no more SPOF in a properly deployed Apache HBase setup. Would be nice if you can correct that :)

    • Aaron

      Good to know! I didn’t know that – I’ll go ahead and edit the post to reflect those changes.

      The HBase community is doing remarkable things and everyone I’ve talked to who is up and running with it really seems to like it.

  8. Gophen McTerdic

    I reckon that any product with a name like “Mongo” must be treated with suspicion. Is it a database system for developers with Downe’s syndrome?

  9. Akshay

    Great post and links to other posts. Learnt a lot. I’m a newbie to this field and just starting my career out, and all these posts are helping get a hold of the whats, whys and hows of Big Data/NoSQL. Especially love the comments by die-hard RavenDB users :)

  10. Stephan

    Really great post indeed and some points are in line with what we found out. We also thought that sharding was one of the weaker pieces. Our second concern was for instabilities with high volume batch writes, although I would not be able to tell if it is the Raven or the Lucene part.

    We really loved Raven, especially the Linq and Lucene integration. I think it is a great system for smaller data volumes. We are now moving on to test Couchbase 2 with ElasticSearch. Full text search is a key requirement for us.

    The Raven team would do much better with less heated arguments. Comments like “RavenDB’s sharding model is superior because it completely eliminates the need for cross server talk” or “Ayende has just solved it superior to them” does not demonstrate much openness to suggestions :-(

  11. Sean Cribbs

    I’m sorry you had a hard time with Riak. While our documentation and marketing probably says “use MapReduce” more than it should, in practice we discourage people from using it until they’ve tried to model their problems as regular key-value access. On the other hand, when people say “analytics” and “Riak” in the same sentence, we raise eyebrows because it usually ends up implying OLAP models, which are not Riak’s strong suit (Hadoop is good for this type of thing). Riak tries to be highly available for reads and writes, with consistent latencies, not to give rapid *bulk* access to large datasets.

    Either way, it sounds like you found a good combination of tools for your application. Kudos!

    • Aaron

      Sean,

      Riak is a very impressive piece of technology, but as you point out not the best fit for our use case. In a pure K/V scenario it would be my first choice – the Basho guys we spoke to were all very helpful and professional and we’re confident that it can scale well / be easy to manage.

      It was just an issue of right tool for the job more than anything else. I may be biased but I think you should strongly consider adding distributed counter support in the near future ;)

×

Comments are closed.