Have you noticed that some business questions are easy to express in words, but not in code? For example, you might want to find “abandoned cart” moments: a user adds an item to a cart but then doesn’t checkout in the next hour. This is not trivial to describe in SQL unless there is already a field for this use case in a user session table (and it never seems to be there, does it?).
Questions like this one, which rely on order or duration between events, are hard to approach using relational data and SQL. They usually involve 50+ lines of window functions and/or self-joins with all the ensuing debugging. And when you are finally done with the query, you need to wait for your result, because these queries are computed inefficiently. SQL wasn’t designed for these kinds of sequence-oriented questions.
Meanwhile, these questions are everywhere and are increasingly important for understanding user behavior and business performance:
At Motif, we are creating sequence analytics — a toolkit for naturally working with sequence data. We have developed Sequence Operations Language (SOL) — a small but powerful DSL to easily query and wrangle ordered lists of events. It is not intended as a replacement for SQL (long live the king!), but rather as its complement. In fact, you can use them both in the same analysis because sequences are stored as a relational table of events with a thin layer of sequence metadata on top.
We got inspiration for SOL from many existing languages and systems: regular expressions, Cypher, SQL’s MATCH_RECOGNIZE, Python lists, SASE+, dplyr. Neither of them seemed sufficient by itself, but by combining the best parts of each, we believe we found a good balance:
- powerful pattern matching from regex
- intuitive readable syntax from Cypher
- vector operations and slicing from Python lists
- minimalism and chaining from dplyr.
The key idea: matching event patterns to define important events and sub-sequences (purple variables in the examples below) and then operating on them as arrays of events. For example, here is how you can approach example questions above using SOL:
Succinct and fairly readable even without knowing anything about the language, isn’t it?
SOL is great not just for querying sequences but also for modifying them through “replace”, “match split”/”combine” and “set” operations. This way you can wrangle your dataset on the fly into the shape most appropriate for exploring your question. Here are a few examples:
No introductory post would be complete without throwing in a framework or two, so here are a couple of interesting ones to contrast sequential and relational approaches for working with data. Here we provide just a quick taste and will elaborate in future posts.
State vs. history lenses
Complex systems can be described through their state or history. For example, the game of chess is usually specified through a position of pieces on the board (state), but can also be described through the sequence of moves players made in the game (history). Most analytics data comes in as “history” (events) but analytics in data warehouses today is done through the “state” lens — user events are filtered and aggregated into tables of pre-selected important moments and counts. SQL nudges to explore data through the “state” lens, while SOL is natural for working with “history”.
Cross-sectional, time series and panel data
The relational model usually operates on cross-sectional (user dimension tables) or time series (metrics tables) data, while the sequential model works with more complex panel data, which combines both.
At its core Motif is a sequence analytics company. Even the company name is about helping to uncover business motifs — important patterns in event sequences. We’ll be sharing more about our work in the near future, but if you have reached the end of this post, you are definitely ready to try sequence analytics yourself.