When should I use Apache Druid?

I started having some good side discussions about Druid and the most common question was “when should I use Druid?”. The good news is the Druid documentation under the
Latest Design answers this question directly:

Druid is likely a good choice if your use case fits a few of the following descriptors:

  • Insert rates are very high, but updates are less common.
  • Most of your queries are aggregation and reporting queries (“group by” queries). You may also have searching and scanning queries.
  • You are targeting query latencies of 100ms to a few seconds.
  • Your data has a time component (Druid includes optimizations and design choices specifically related to time).
  • You may have more than one table, but each query hits just one big distributed table. Queries may potentially hit more than one smaller “lookup” table.
  • You have high cardinality data columns (e.g. URLs, user IDs) and need fast counting and ranking over them.
  • You want to load data from Kafka, HDFS, flat files, or object storage like Amazon S3.

Obviously event based data works very well with Druid, this is why I believe orders are a really good match for this. Because you can tie three critical pieces together for each order: SKU, Customer data, and Shipping, it becomes very easy to execute all kinds of queries tieing these data points together.

While I am somewhat stuck on eCommerce, here is a list of other companies that also use Druid for very different use cases (link).  Here are a few of my favorites:

Airbnb – Druid powers slice and dice analytics on both historical and realtime-time metrics. It significantly reduces latency of analytic queries and help people to get insights more interactively.

eBay – eBay uses Druid to aggregate multiple data streams for real-time user behavior analytics by ingesting up at a very high rate(over 100,000 events/sec), with the ability to query or aggregate data by any random combination of dimensions, and support over 100 concurrent queries without impacting ingest rate and query latencies.

Hulu – At Hulu, we use Druid to power our analytics platform that enables us to interactively deep dive into the behaviors of our users and applications in real-time.

Monetate – Druid is a critical component in Monetate’s personalization platform, where it acts as the serving layer of a lambda architecture. As such, Druid powers numerous real-time dashboards that provide marketers valuable insights into campaign performance and customer behavior

Nielsen – Nielsen Marketing Cloud uses Druid as it’s core real-time analytics tool to help its clients monitor, test and improve its audience targeting capabilities. With Druid, Nielsen provides its clients with in-depth consumer insights leveraging world-class Nielsen audience data.

The original list is pretty large, it is fairly safe to say Druid has a place in many markets!


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.