Why Can’t You “Pull Data Real Quick”?
Every person working with business data knows that you can’t “just pull data real quick”. Tweets on the topic go viral:
If you’re reading this, chances are you’ve got a traumatic personal example you’re thinking of right now. Taking longer to answer questions means fewer iterations in a typical data deep dive, and thus — shallower insights. It also precludes data organizations from approaching many business questions altogether: when cost becomes too high, ideas fall below the prioritization threshold and get cut.
Why do we settle for such a low bar? As the tweet above suggests, the usual suspect is the lack of properly organized data. However, is this the whole story? Can we do better even with raw, less-structured data? Let’s look deeper at the typical friction points for answering business questions with data on a fictional, yet surprisingly familiar story.
Prologue: Important assignment
Imagine you are an analyst at a video streaming company. Leadership wants to overhaul the homepage and you are tasked with understanding the performance of its various elements — search, banners and curated video suggestions — to guide the page revamp. You are excited to help your company be more data-driven and get straight to work.
Chapter 1: Hidden data
First, you check canonical tables in the data warehouse, hoping you can use them directly. Unfortunately, they don’t store items displayed on the homepage, so you ask other analysts for suggestions. One custom table looks promising and you play with it for 15 minutes before discovering that it only contains homepage views for US users. At this point you decide to create your own custom rollup table from raw logs (user session events). Turns out there are 3 separate tables with session logs: one for the website, one for the Android app and one for the iOS one. They use different schemas and most field names don’t make sense like “payload_list” and “cassandra_items”. Are these the homepage elements you are looking for? You spend half an hour trying to find the engineer who logged these events but learn that the person has already left the company. You query and aggregate logs in different ways, hoping to discover some patterns and guess the fields’ meaning, until somebody remembers a Google doc by an intern from the previous year, who documented event dimensions when she was doing her own data deep dive. With its help you are ready to analyze homepage performance!
The problem of getting lost in the labyrinth of custom data tables and columns is well recognized in large tech companies. With the proliferation of easy-to-use data engineering tools it is now spreading to smaller and non-tech-core businesses. The easier it is to model in the warehouse, the easier it is to be non-intentional with that modeling, and this is how someone’s good intentions indirectly lead to you spending half of your day wondering which tables to use.
A typical approach to solve this problem is to build or buy a data catalog service (for example: MetaphorData, SelectStar). Such services are up and coming, but they require additional manual work to fill out and keep up-to-date metadata, so it’s too early to say that this is a solved problem.
Chapter 2: Missing data
You start writing queries when you realize that you can’t identify the order in which elements were displayed on the homepage. This information is critical because items lower on a page naturally get fewer clicks regardless of their relevance. Looks like folks who created homepage logs did not think about including this piece of data. You ask the discovery engineering team to update logs. They are busy and will be able to make changes in the next sprint in 2 weeks. After that you’ll need to wait a few more weeks for the mobile app versions with log changes to roll out to users. This is too long, so you decide to simplify your analysis for now and make an assumption that video suggestions are distributed uniformly across all possible order positions. This is a questionable assumption, but it allows you to proceed and analyze homepage performance!
Unfortunately, not logging data, which turns out to be necessary for later analysis, is common. Usually it is due to insufficient initial investment in a logging framework, to logging being distributed across many engineering teams or to logging being separated from data consumers.
Typical approaches to solve this problem involve integrating a dedicated logging service (for example: Segment, Snowplow, mParticle) or relying on auto-logging, which captures rendering of every UI element and every user input: click, swipe, typing, etc. Both improve on the status quo but still require careful implementation and maintenance.
Chapter 3: Unreliable data
You get to work and write a query to compute click-through-rates of each homepage element. The numbers look suspiciously low. You slice numbers by different user dimensions and time ranges to try to understand what’s going on and notice that the number of clicks was close to zero a month ago. You ask around and somebody remembers that there was a bug, which could have affected logging and which was fixed last month. After spending some time to track the exact timing and impact of the bug, you figure out how to remove parts of the data from the analysis to work around it. Now the click-through-rates are reasonable. You plot a few graphs and schedule a meeting with your team to review your findings!
Most analytics teams grapple with data regressions. They can happen for many reasons: production code changes / infrastructure migrations without updating logging, production outages affecting logging in unrecoverable ways, data pipeline logic errors and failures going undetected for days, etc. They’re doubly painful when you find them after you’ve already done extra work to find the data you need.
Typical approaches to solve this problem consist of building or integrating a data monitoring solution (for example: Datafold, BigEye, MonteCarlo) and creating automated data backfills (cascading, if possible).
Chapter 4: Ambiguous questions
The meeting doesn’t go as planned. Colleagues question several aspects of the analysis and make hypotheses about hidden variables affecting the results, such as video category and new vs. mature users having different click-through-rates. But most importantly, questions are raised about the validity of the click-through-rate metric itself. What if optimizing for it would lead to the clickbait trap? People debate using session watch time, retention or revenue as a better guiding metric. Somebody brings up the possibility of users visiting the homepage multiple times on the same visit and the challenge of attributing any non-immediate metric to individual page elements. As the meeting time runs out, you suggest “for the sake of time” to go with the compromise metric of the abandonment rate (watching less than 1 minute) of the next video after a click. Folks quickly agree and rush to their next meetings. You are ready to finish the performance analysis!
The most important business questions are rarely about computing a well-defined simple metric. They usually come in the form of ambiguous things like “why has our conversion funnel degraded over the past quarter?” or “how can we improve our marketing campaigns?” They need to be broken down into approachable data questions and hypotheses, which can be verified or falsified through data analysis. Navigating through such data exploration journeys requires strong understanding of the business, product, logging, logic and statistics. There are usually only a few people qualified to do this work and existing tooling would take them days or weeks to answer even one meaty question.
Today, there are no technical solutions to this challenge. Some startups claim to be building ML tools, which will replace senior data scientists. Naturally, almost everyone in the data world is skeptical. Instead, companies deprioritize using data to answer most big strategic questions and go with good old leadership hunches.
Chapter 5: Unaskable questions
You work on updating your query but it takes you a surprisingly long time. Slicing by video category and user lifecycle requires joining with video and user dimension tables and grouping them into appropriate buckets. But computing the abandonment rate of the next video after a click turns out to be the most challenging. You need to identify ordered pairs of clicks and video watches and limit them to a small gap in time. There is no simple SQL command for that, so you create your own approach using custom window functions. After several hours of debugging and tweaking the query it grows to 100 lines but you are now ready to get the final numbers!
Any data practitioner will tell you that many well-formed data questions turn out to be hard to input into a computer in a way it understands, so it can crunch the numbers and provide you with an answer. Dashboards are notoriously inflexible. SQL often requires mental gymnastics, resulting in 100+ lines of brittle code, and complex workarounds for important use cases like sequential analysis.
A typical approach to solve this problem is to focus on simpler questions, which are better suited for SQL, and to maintain a small number of gigantic SQL queries for complex questions, praying that they won’t break or need extensions. Neither is a true “solution” in the traditional sense.
Chapter 6: Slow queries
You hit “Run query” and stretch in your seat waiting for the magic moment. A minute later you check your email while the query is still running. Then you go to get a coffee. Finally, half an hour later you get: “ERROR: Query has timed out.” Must be those additional joins and heavy window functions. You get advice from a few other analysts and try some query changes but it still times out. You switch to the overnight compute environment for large queries and call it a day. The next morning the numbers are ready. You send them to the team but are asked for a small tweak. You realize that this might happen again and that running ad-hoc queries overnight doesn’t scale. A colleague suggests sampling the data. You try it and realize that sampling requires extra care to avoid biasing the results. A couple more hours of work and you have finally custom-designed a query, which works for your use case. Nobody can understand it well at this point, but it works!
Growing companies usually engage in an ongoing battle with growing data query times. Growing user bases and logging more granular data per user keep pushing the limits of query engines.
Typical approaches to solve this problem include adding more hardware for parallelizable queries, building and maintaining custom rollup tables for important analytics use cases, analyzing on shorter data time spans or sampling data, which analysts implement manually for each case.
Chapter 7: Lost insights
You schedule the final meeting with the cross-functional team working on the homepage redesign to share the results. This time you are prepared and have slides with graphs ready to address most of the questions. Everybody seems convinced except for the group working on the highlighted videos from big video creators. Results show that their section on the homepage performs the worst and they raise new questions about the validity of the analysis. Due to the lack of alignment, the leaders of the meeting suggest “taking it offline” and coming back next week. In the meeting the following week the leaders announce that it probably makes sense to keep the architecture of the homepage as is for now and just freshen up the look. The shift in direction is explained by the increasing number of reports of inappropriate videos, which calls for a change in the organization’s priority.
Many analytics teams report challenges in fitting their data insights into the company’s decision-making processes. This challenge is not about answering a business question per se, but rather about the ability of an organization to take action based on the answer. This can happen due to internal politics / incentives or due to analyst voices drowning in the ocean of information washing over the heads of decision-makers.
A typical approach to solve this problem is… to ignore it. The topic is often too sensitive, can lead to conflict and hurt promotion chances, so it is rarely addressed.
Epilogue: Can we ever “pull data real quick”?
After several weeks of intense work and no visible outcome you start to wonder: is it ever possible to run a quick data analysis? Maybe you should switch to building ML models instead…
The outlook seems gloomy. The pain points in the story above are widespread while the progress to solve them is slow.
We have experienced firsthand this pain while working at small and large tech companies and came up with a categorically different approach for running data analysis. We believe the time has come to build an analytics tool that thinks deeply and tackles directly friction points for answering business questions with data. That’s why we’ve started Motif Analytics.
If this problem intrigues you, we’d love to get in touch and show you what we’re working on: firstname.lastname@example.org. We’re currently in stealth, looking for early partners and actively hiring for positions in design, data engineering, frontend engineering and analytics.