Product Recommendations made easy with Apache Druid Part 1

I have been playing with Apache Druid for a bit now and I have to say I am very impressed with this package. Druid provides fast analytical queries, at high concurrency, on event-driven data. Druid can instantaneously ingest streaming data and provide sub-second queries to power interactive UIs.-link. Apache Druid essentially does all of the bulk lifting of segmenting the data and putting it into high performing indexes for super fast queries . You can stream the data directly into Druid using API’s or Apache Kafka, or you can simply upload massive amounts of data at intervals appending or replacing.

Because Druid does so much for you, you could actually run different campaigns using completely different data sources that are stored and indexed in Druid. Imagine running a campaign for “Hottest Items Last Fall” or “Seasons top sellers”. This would produce a product shelf similar to this on your eCommerce site:

Screen Shot 2019-08-07 at 10.17.20 AM.png

Those products could have been returned by Druid in real time, sorting the resulting SKU’s by order value, quantity sold and even filtered for things like shopper attributes (age, gender, location).

Screen Shot 2019-08-07 at 1.40.12 PM.png

Druid let’s you store as many data sources as you want, so you could actually build dynamic components in CoreMedia that can run the same campaigns on different data sources. This could be used for different brands and their SKU’s or even seasonal order data.

Screen Shot 2019-08-07 at 10.20.22 AM

For my use case, this means you could essentially push order line item data into Druid and get fast queries for product shelves like “Top Sellers“, “Top Weekend Sales“, or even “This weeks hits” – all based on the order line sales and the time and date stamp of the order.

Pushing this line item level order information should be trivial for most order management systems. I started to ask myself what data would I actually need to satisfy a few use cases. So I started writing some use cases down as one liners:

  • Most products sold
  • Total sales
  • Highest Total count sold on day of week
  • Highest Total count sold in month of year
  • Highest Total sales on day of week
  • Highest Total sales in week of year
  • Region top seller
  • Men top seller
  • Women top seller in region

——————————————————————————————

When should I use Apache Druid?

Read about how Neilsen, Monetate, eBay, AirBnB, and others use Apache Druid.
——————————————————————————————–

I then had to figure out the minimum amount of data needed to be able to do those use cases and this is what I came up with:

“time”, “order_id”,”shopper_id”,”sku”,”price”,”quantity”,”cost”, “shipping_info”

That is all pretty standard information you can get from a PO. What is not part of that is the customer demographic information.  Because Druid performs best with flat data we will most likely have to write a routine that combines order line data with customer attribute data. We could include fields like these (if they are known):

“age”, “region”,“gender”:

This would allow us to ask Druid many different queries and get the proper response. In the CoreMedia extension model this should really be a returned list of SKU’s that we can map to the current product catalog. Some error handling or SKU replacement code might be needed; especially if you are running against year old data. Hopefully for more current campaigns like “Hottest Weekend Products” or “What’s hot this month” the data and SKUs very up to date. The resulting JSON sent in for each row would look like this:

{
"time":"2019-06-30 03:53:35",
"order_id":"id_055300006130",
"age":"40",
"region":"Midwest",
"gender":"M",
"shopper_id":"U_09080785",
"sku":"PC_CHEF_CORP_MP_KNIFE_SIGNAL_RED_SKU",
"price":79.0,
"quantity":3,
"cost":237.0
}

Sending in each order line item separately will allow Druid to actually dynamically build orders, return SKU’s based on any time and date combination, bloom filters, numeric expression, and of course grouping (total sales for a single SKU)- link.

I created a dataset with six months of order data, broken out by each line item as described above. It ended up being 431,148 line items created for 4,323 SKU’s in 300,000 orders

I went ahead and created queries for each of those use cases and I find Druid is extremely fast (more on that in Part 2), even when running on my local machine. Check out the slide show below for the various ways you can use SQL (or JSON) to query Druid. The real power comes with the way Druid can quickly return rows and run on functions like TIME_EXTRACT. Each query essentially returns a list of SKU’s ordered descending from either a total sales count or an items sold count.

This slideshow requires JavaScript.

Stay tuned for part 2 where I show how easy these kinds of dynamic product shelves based on sales and shopper data can be integrated into CoreMedia Studio. I will also show a demonstration where Apache Druid is accessed in realtime from our Studio where the maketing person can easily preview this dynamic behavior. A little teaser showing how the authoring environment (Preview CAE) and the runtime environment could access the same Druid data, giving marketers the same products as the shoppers would see.

 

I am really interested in hearing your thoughts on this, send me an email or leave a comment!

Screen Shot 2019-08-07 at 3.45.25 PM.png

Advertisements