Video Coding for Machines: The Need for Compression




Video Coding for Machines: The Need for Compression

Video Coding for Machines: The Need for Compression

September 26, 2024 / Posted By: Fabien Racapé, Hyomin Choi

More than 82% of all traffic seen on the Internet corresponds to video streams and the vast majority of this video traffic consists of 2D video streams to be viewed by end users on flat displays like smartphones, TVs, tablets, and the like. Looking ahead, we expect the proportion of video traffic to increase with new devices, formats, and modalities, including video destined for head mounted displays, high resolution content beyond 4K, and immersive video streams augmented with haptic or other sensory information. For these applications to be successfully deployed, efficient compression technologies are essential, and MPEG standards will play a critical role in enabling this.

While much of the aforementioned video content is directed at human consumption, there is a rapidly expanding set of applications and use cases that human eyes may never see and where machines and algorithms are the intended consumer. For example, consider a next generation factory assembly line containing several video sensors that “watch” items pass by on the assembly line and send the video feed to a central analysis unit, which stops the assembly line when an unauthorized object is observed on the line. These detailed and complex operations can all be perceived and processed without any human intervention.

Separately, the number of video surveillance cameras is expected to reach 1 billion globally by the end of 2026, a trend that is bolstered by the incredible success of AI algorithms and models to solve vision/perception tasks like object recognition and tracking at near human levels and the widespread deployment of sensors capable of capturing and transmitting video streams, including home security cameras, factory automation sensors, and IoT devices. Typically, these low-cost devices lack the compute resources, or the battery capacity, to run the complex AI models required by the vision tasks, so these operations are often better suited to run in edge compute servers, or cloud environments. In this growing ecosystem, the need for new compression mechanisms arises.

For clarification, the term Video Coding for Machines (VCM) refers to two distinct concepts. First, VCM refers to the high-level concept of compression video streams for machine consumption, while the second usage of the term refers to the particular MPEG approach to address these needs. Both terms are explored in this blog.

Why Do We Need Video Compression for Machines?

As these use cases become more prevalent, it is entirely conceivable that the total volume of video traffic meant for machines and algorithms could rival that of video consumed by humans. A critical consideration is that algorithms and the human eye care about very different things. In the previous assembly line example, the vision task only needs parts of the frame that can inform whether the object detected is unusual in shape, dimensions, or geometry, but the rest of the frame is of no interest. The human eye would need very different information to contextualize and make sense of what’s in the frame. By discarding this ancillary, human-specific information, we would need far fewer bits to support the successful execution of the machine task. Thus, good compression algorithms and systems for machine vision will be key to allow the network to support this demand for bandwidth.

A common theme in the use cases we outlined is the “off-loading” of the compute, or the AI vision model that runs on the video, to a different entity at the network edge, or even in the cloud. With this flexibility, we can explore interesting trade-offs between power consumption, network bandwidth, and latency.

We can distinguish two approaches to this coding. First, we can fully offload the vision task, like when a sensor device encodes and transmits the video to a remote server that runs the AI vision model on the decoded video. The second approach consists of splitting the AI vision model into two parts – one that sits at the sensor device, and the other that runs on the remote server. This follows the so-called “split inference” architecture, which involves the transmission of intermediate features. In both cases, the efficient compression of the transmitted data from the sensor device to the remote server is crucial to build viable network-assisted computer vision applications.

Video and Feature Coding Standardization Efforts

These two approaches are both being actively studied for new standards in the MPEG Video Working Group.

Video and Feature Coding for Machines Diagram  Source: InterDigital

 

Video and Feature Coding for Machines Diagram Source: InterDigital

Video Coding for Machines

The first approach described above is being standardized in MPEG as Video Coding for Machines (VCM). As a standard, VCM saves on energy consumption at the sensor device but may require more bandwidth to transmit content than with split inferencing. Video codecs can be made more efficient here – because not every detail of the video is needed, only the information relevant to accomplishing the vision task will be selectively coded.

Feature Coding for Machines

The second approach, Feature Coding for Machines (FCM), is also being studied and standardized in MPEG. Here, features, or the intermediate data, are tailored for the target vision task(s) and compressed in bitstreams smaller than the original video data to be transmitted over the network. This approach saves on bandwidth but incurs a slightly higher energy cost by running part of the split AI model at the sensor device.

In practice, these two approaches can be used interchangeably depending on use cases, device type, hardware capabilities, network environment, or task requirements.

InterDigital's Approach to Video and Feature Coding

At InterDigital, our research investigates both approaches and their tradeoffs. Our open-source software CompressAI-Vision enables researchers to design and compare performances of split and remote inferencing pipelines. We are also actively involved in MPEG standards, contributing both software and core experiment coordinator roles. Notably, InterDigital’s CompressAI-Vision was adopted as an evaluation platform for the MPEG-FCM standardization effort.

The Future of Video Coding for Machines

Some of the approaches we discussed can be applied to use cases that are already being defined and developed by industry, like smart cities, intelligent traffic management, vehicle-to-everything communications, and connected/intelligent farming, to name just a few. There are several more likely to emerge that have not been envisioned yet. Our research activity in this area is looking to build efficient, flexible, robust tools and algorithms that will allow these applications and ecosystems to be built and scaled. As a complement to this, our active participation in standards helps reduce the complexity of deploying these systems by designing standard interfaces and protocols, all of which go to support the widespread deployment and eventual success of these future-looking tools.