SQL is the cornerstone of today’s analytics. It is the preferred way to ask data questions and analytics tools seem to thrive or die by their ability to support it. SQL is even spreading beyond its traditional domain: the quick rise of dbt can be attributed to its ability to run data engineering—a domain of traditional programming languages—using good old SQL. But SQL is not a silver bullet, and you probably already know that.
There is a trove of linguistic research about the interconnectedness of human language and thinking (examples in the domains of color, orientation, time). It is a big idea behind the widely practiced cognitive-behavioral therapy. The most vivid examples might come from popular culture: the alien language in the blockbuster Arrival and Newspeak in the novel 1984 demonstrate how language limits of our ability to perceive the world and what is possible. What if SQL is similarly constraining us in data analysis? What if we have a big blind spot for the types of business questions we can answer using data?
Benn Stancil has explored a similar idea in Has SQL gone too far?. He noted the difficulty of computing funnels with SQL:
But lots of business concepts aren’t easily defined this way. Consider, for instance, a sales funnel in which prospects move through a series of stages, like “In trial,” “Contract negotiation,” and “Under legal review.” These stages often have an expected order, but not a strict one. Some prospects skip stages, some proceed through them in a different sequence, some stages happen concurrently, and sometimes, if a sale cycle goes sideways, steps get repeated.
He could have taken this insight much further: SQL is notoriously bad at analyzing sequential data while funnels are everywhere.
Today pretty much all analytics data comes in as streams of events. It is natural to operate on them through defining sequence patterns: a user did this, then they did that, and they came from this or that source. Yet, before any analysis is done, we usually “make raw data more accessible” by stripping away time and forcing user sessions into orderless data warehouse entries. Most teams have some sort of “raw_events” tables at the start of data pipelines with legends of a few sages who mastered them and only the most brave souls daring to use them for analysis.
Though we usually think of metrics as SQL scripts, many business metrics are better suited to be computed through sequence patterns rather than row aggregates, especially around user behavior: acquisition, onboarding, retention, attribution, custom user states, etc. For example, retention takes 50+ lines of SQL but can be naturally expressed as a sequence pattern: “number of users who did A and then repeated A next week” divided by “number of users who did A”. Try it yourself, and you’ll quickly see how this applies to a wide range of business metrics you truly care about.
More generally, we often need to find users in a specific cohort, segment or some custom “state”. These are usually hard to define in traditional relational tables using SQL but easy to express through user paths leading to these states. How long would it take you to compute the number of users who have abandoned carts on your e-commerce site? Unless it is already a defined metric in your data warehouse (and hey, good job!), probably hours to days, if you are using SQL. In a sequence-focused language it would be as simple as: number of sessions with “item added to cart” NOT followed by “item removed from cart” or “checked out”. The more time you spend doing this kind of analysis in SQL, the clearer it becomes that something is off and there must be a better way.
Now let’s look at the most important business questions: going beyond reporting to establish, or at least hypothesize about, causal relationships between changes to the product and their effects on the business. The most fundamental property of causation is that a cause has to come before its effect. Shouldn’t we then strive to preserve the order of events in our data? Sequences are a better medium for encoding and exploring causation than a set of disparate metrics and dimensions!
Finally, everybody complains that data is “dry” and hard to comprehend for most people. Humans need to think hard to understand statistics but readily grasp, relate to and remember stories. How often do you hear about the importance of “telling a story with data”? Stories are defined by heroes and plots. Sequences bring exactly that by capturing user journeys. Metrics — not so much.
Hopefully by now I have convinced you of the tremendous, yet mostly untapped, potential of sequence analysis. What if we could ask sequence questions as easily and quickly as we count rows in SQL? What if understanding “how many users did A and then B within 30 minutes, but never C in between A and B” took less than a minute instead of days? This requires extending SQL into a dedicated sequences paradigm with new data concepts and operations.