Mastering the Art of Scale


In the fast-paced world of digital streaming, the ability to scale and adapt is crucial for delivering high-quality user experiences. With millions of users engaging with dozens of apps at a time, operations teams need the scale and agility to monitor and optimize every individual user experience in real-time. It’s a massive scale challenge.

As audiences continue to grow, traditional monitoring and alerting systems often need to catch up, unable to deliver real-time data analysis. Conviva is changing the game with its Operational Data Platform, a powerful platform that delivers stateful insights in real-time and at internet scale.

Conviva has pioneered a new data analytics paradigm called Time-State Analytics that enables companies to do experience-centric operational analytics at census-level, at scale, and at 10X lower the cost and 10X faster than they can today.

The Need for Real-Time Insights

When viewers don’t have an ideal experience, they quickly switch to other streaming providers, leaving no margin for error for the video streaming providers. Traditional monitoring and alerting systems have been unable to deliver in an environment that demands real-time, actionable insights from high cardinality data at internet scale.

Operations teams need scale and real-time insights to monitor the streaming infrastructure and optimize for real-time user experience. Conviva empowers these operators with its experience-centric observability platform to deliver stateful insights in real-time and at an internet scale.

The Power to Compute at the Internet Scale

As the trusted provider for 12 out of the 15 biggest streaming services globally, we are constantly testing and expanding our scale to meet the growing needs of consumer applications. One of the leading streaming providers chose Conviva for holistic, real-time user experience measurement for an exclusive live-streaming event they were hosting.

As part of the readiness exercise for this event, our engineering team recently scale-tested our platform – simulating some of the common behaviors of 50+ million viewers and computing stateful metrics with actionable real-time insights across all these million viewers. Across the streaming industry, reaching a peak concurrency of 50 million viewers is a record that few services can support.

We were able to validate that during a large-scale live event with 50+ million viewers watching a live stream, Conviva’s platform could seamlessly scale to ingest 300 GB/minute of telemetry data, translating to about 75 million unique events per-minute, then to compute 12-billion stateful metrics per minute for real-time and near-real-time, and to enable 20,000 Metric Queries per-minute across APIs and Dashboard so that our customers can get actionable insights on their viewers’ experience in real-time. This massive scale and processing capability empower operations teams to quickly identify and remediate issues while ensuring viewers enjoy seamless streaming experiences.

This test validated the scale of Conviva’s Experience-centric Observability Platform’s ability to deliver actionable real-time insights during the world’s most significant streaming events and handle the scale of the world’s most popular consumer applications.

Key Take-aways & Learnings from Handling this Scale

Here are some high-level details of how we simulated such a 50+ million viewership.

    • We leveraged 1000+ Compute Nodes and Networking resources from the Public Cloud to simulate 50 million concurrent users watching a Live Event.
    • This simulated traffic was then routed to our Operational Data Platform so that it can compute stateful metrics from these telemetry data and deliver stateful analytics on high cardinality live and historical data.

Here are some high-level details of how we hardened our platform to ensure reliability during such massive live streaming events:

    • Ensured with a comprehensive Capacity Planning to address for various scenarios
    • Robust & Scalable Platform Architecture with well-contained constraints enabled our engineers to do vertical and horizontal scaling as-needed.
    • Our engineering teams use Prometheus and Grafana to monitor our platforms. Each component owners identified key metrics that are leading indicators to ensure that our teams can quickly detect and remediate issues without impacting our customers.
    • Identified key leading indicators and enhanced monitoring at every component level to ensure that we have automated alerts to quickly detect critical bottlenecks in the platform
    • Enhanced the platform to have multiple levels of redundancy in the form of multiple geographically distributed cloud regions to prevent from any single point of failure
    • Validated using happy-path and negative test cases to ensure that our platform is fault-tolerant, doesn’t require sampling, and offers high-fidelity insights by operating on full-census.

AI Alerts for Actionable Insights

Besides stateful analytics on real-time and historical data, Conviva’s platform also enables the operations team to get accurate AI-based alerts to quickly filter through noise and only focus on what truly impacts end customer experiences.

Conviva’s AI Alerts provide unparalleled levels of actionability with its ability to scan over 2 billion metrics per minute to identify potential anomalies. When the platform’s AI subsystem detects anomalies, it doesn’t just trigger an alert but also provides a detailed analysis of the events that led to the anomaly.

For example, during one of the live events, our AI-based Alerts detected an anomaly while scanning across the cohorts from the Conviva’s computed metrics about how some viewers are experiencing high amount of exits before the video started. This detection triggered an alert with specific insights for the operations team to quickly triage.

Our Explainable AI not only explains why an alert was fired but also what metrics were impacted and helps identify potential root causes with specific threshold level (Info, Warning etc.) of this alert. Such insight enables the engineering and operations team to lower their mean time to detect (MTTD) and recover (MTTR). This intelligence enables operational teams to deploy resources effectively, mitigating issues before they widely impact user experience.

Real-Time Operational Platform for Seamless Streaming

In the streaming industry, just seconds of delay or buffering can result in thousands of lost viewership. With such high stakes, Conviva’s real-time operational platform with its ability to compute metrics and trigger alerts at full-census (and not on sampling) empowers operational teams with the insights they need to deliver optimal experiences to every user.

Conviva’s platform doesn’t just provide a wealth of data; it offers a depth of insight to the operations teams that is unmatched in the industry. Operations teams use Conviva to gain full context and stateful analytics on high cardinality live and historical data while doing granular drill-down on any metric to dive into the regions, devices, and individual users experiencing an issue.

For example, during one of the key live streaming event, Conviva reported that just 0.03% of users (but still thousands of viewers) were experiencing a specific poor quality of experience (QoE). The streaming providers were able to quickly identify the root cause and improve experience for those users. This level of insight is only possible because of our ability to generate metrics at full-census and not rely on sampling, thus delivering instant error detection even for the outlier use cases – essential for delivering exceptional user experiences and maintaining a edge in this competitive streaming market.

Uncompromising Commitment to Excellence

What sets Conviva apart is its uncompromising commitment to customer excellence. During the scale-up exercise as well as during live events, our engineering team partnered closely with our customers to ensure that our platform can offer high-fidelity insights.

As the streaming landscape continues to evolve, Conviva is at the forefront, leading the way to a new era of operational data and alerting. Our platform offers a canvas for innovation and a foundation for success, empowering operations teams to deliver the highest quality user experiences possible.

Ready to master the art of scale and redefine the future of digital experiences? Get started today.