Conviva’s Approach to Experience-Centric Operations & Stateful Analytics

Your business needs Experience-Centric Operations

While Observability and Application Performance Monitoring (APM) tools have transformed how engineering and operations teams function over the last ten years, there are several challenges that every VP of Ops and CTO still face with no clear solution in sight:

Being surprised by customer issues on social media, during customer support calls, or through the CEO, even when their tools say everything is fine.
Needing an army of expert engineers to debug whenever a major incident happens.
Exploding costs of observability tools to monitor their growing infrastructure.

Previously in our series, we defined Experience-Centric Operations and discussed why existing tools can’t handle the challenges. In this final entry in our series, we reveal how Conviva’s approach to build a platform for Experience-Centric Operations involves two key technologies: 1) A thin SDK (we call Sensor), which is a new approach to collecting data from application endpoint and 2) Time-State Technology, which is a new abstraction and system for efficient, real-time, and stateful data processing at scale.

Thin SDK Strategy Enables Light-Weight and Accurate Monitoring of Experience

The first challenge on the way to connecting user experience to system performance is the need to deploy an SDK, which is the last thing any app development team wants to do.

We get this. We’ve lived this reality for 15 years as we currently manage about 7 billion SDK instances across thousands of apps and device models, web browsers, phones, tablets, smart TVs, set-top boxes, game consoles, and even cars. From this experience, we can say with confidence that it’s not possible to avoid collecting data from the app when you’re trying to measure true user experience. What we also know, however, is that we can do much better than all the RUM and product analytics SDKs out there.

The number one lesson we’ve learned regarding SDKs is that they have to be extremely simple and do as little work as possible. This means we must resist doing any computation or manipulation of data in the SDK. This is counterintuitive because at first glance, it seems like the cheaper option, since we’re using the device’s CPU cycles instead of our backend, and the easier option, since the metrics are directly applied to the data. In reality, though, this makes the SDK heavier, which slows down the app, and more prone to obsolescence, since the metric implementation will fall out of date and become inaccurate for weeks or months at a time, before an update can be applied. If you’re computing 20 metrics across a dozen device types, that’s 240 metric implementations that could be outdated and even erroneous.

To solve this, our SDK only collects events without enforcing any semantic schema. We ingest the events exactly as they are exposed by the device OS and your app. This means developers don’t have to instrument each event like they would if they wanted to process the data in Adobe or Google Analytics or if they were sending custom events to a RUM tool. Instead, they can simply connect their app to the Conviva platform through our SDK, and we can pull in all event data without sampling. The magic here is that the developers can then control which events they want to pull in, map and rename events, and create any stateful (or stateless) metric from these events—all in the backend.

This approach does mean we are taking on a lot more processing load on our backend than other systems. Dynamic event mapping and stateful metric computation were not actions existing tools could do in real-time at the scale of millions of endpoints, but they were the actions we needed to accomplish to solve Experience-Centric Operations. It turns out no big-data technology (open source or proprietary) could handle what we needed, so we had to create it completely from scratch.

Time-State Technology Enables Experience-Centric Operations

Let’s restate the problem facing businesses in the digital environment: they need the ability to do stateful metric computation in the big-data backend in real-time at the scale of millions of endpoints, and it must be cost-effective enough to be a practical solution for operations.

We tried to solve this problem with every major big-data platform—Spark, Flink, Kafka, Druid, Clickhouse, etc. None of them can address this issue, and in particular, they are not capable of being the core engine to compute stateful metrics from raw event streams in real-time with high scale and efficiency.

This realization led us to dive in and really understand why this is the case, which led us to a key insight: the problem is not the systems themselves; the problem is the abstraction that every one of them is built on—the tabular abstraction dating back decades, which is at the core of every database and big-data system. The tabular abstraction represents every point of data as rows and columns in a table, with operators for manipulating the data. We found that the tabular abstraction is not good for stateful analytics.

This realization gave us the freedom to innovate at a foundational level, which ultimately led us to create a new abstraction, called Timeline, and our Time-State Technology, which represents all event stream data as timelines with a set of timeline operators to compute stateful metrics.

Timelines offer a more efficient way to write queries compared to the conventional approaches which we tried and found lacking, including streaming systems, time-series databases, and time-based extensions to SQL. As a result, Timelines can model dynamic processes more directly than any of those pre-existing approaches and express the requirements of time-state analytics easily and intuitively. To support this new abstraction, we’ve also codified the concept of Timeline Algebra, which defines operations over three basic Timeline Types: state transitions, discrete events, and numerical values. If you’re interested in the details of Time-State Technology, read more here.

Conclusion: Our results speak for themselves

To give you a taste of the performance gains we are getting with our Time-State Technology, the illustration shows a benchmark of our performance compared to the state-of-the-art big-data platforms. This is our first iteration of the Time-State model, and we have many ideas to further optimize, but we have already seen approximately 10x better performance compared to the SQL-based solutions, which is a key enabler for making Experience-Centric Operations a reality.