Case Study: Human-Centric Video Processing

Role

Research, WebRTC Implementation

Timeline

1 year

Team

4 Researchers

Tools

WebRTC, TensorFlow, JavaScript, PHStorm

Project Overview

During the COVID-19 pandemic, video communications became an essential mode of information exchange. Our research team focused on addressing a critical challenge in this space: reducing latency in video streaming while maintaining quality of experience for users.

Our approach was novel---rather than compressing the entire video frame equally, we developed a human-centric methodology that prioritizes the human figure within frames and selectively transmits this data. This research presented an opportunity to examine existing industry-level video communication tools, as well as explore advanced image segmentation techniques related to the condensation of video streams.

Project Timeline showing Clarify, Ideate, Develop, Implement, and Deliver phases

Challenge

The conventional video streaming pipeline involves encoding data, transmitting it over a network, and decoding it on the receiving end. This process introduces significant latency, especially in bandwidth-constrained environments.

We identified several key challenges:

Standard video codecs compress all parts of the frame equally, without prioritizing important elements
Network transmission represents a major bottleneck in the streaming pipeline
Existing optimization techniques focus primarily on compression rather than selective transmission
Real-time applications require minimal latency while maintaining sufficient visual quality

These challenges led us to rethink the fundamental approach to video streaming by asking: What if we could selectively transmit only the most important elements of a video frame?

Conventional Video Streaming Pipeline

Research & Approach

We began by analyzing the standard video codec workflow, which typically presents a schema for general compression of video data without prioritizing any specific portion of the feed. Our hypothesis was that in video conferencing scenarios, the human figure is significantly more important than the background.

Our approach involved:

Utilizing WebRTC architecture as the foundation for our streaming solution
Implementing TensorFlow's PoseNet for real-time human pose detection and extraction
Developing a system that would transmit only the essential data related to human figures
Creating a reconstruction method that would animate the human figure on the receiving end

This methodology shifts the typical latency distribution in the pipeline - increasing local computation while drastically reducing data transmission requirements.

Implementation

Our implementation focused on three key components:

1. WebRTC Integration

We utilized WebRTC's peer-to-peer architecture to establish real-time communication channels. This involved:

Using the getUserMedia() method to access local media streams
Creating dual peer connections for asynchronous streaming
Establishing local connections through ICE candidates to gather signaling information

          
            // Creating peer connections for WebRTC

            const pc1 = new RTCPeerConnection(configuration);

            const pc2 = new RTCPeerConnection(configuration);

            // Adding ICE candidates for connection establishment

            pc1.onicecandidate = e => {

              if (e.candidate) {

                pc2.addIceCandidate(e.candidate);

              }

            };

2. Pose Detection & Feature Extraction

We implemented PoseNet on TensorFlow to detect and extract human poses from video frames:

Identifying key pose points (17 different body parts including eyes, nose, shoulders, etc.)
Computing feature vectors with minimal byte requirements
Filtering to handle both single and multiple users in frame

3. Animation Reconstruction

On the receiving end, we developed a system to reconstruct human figures using the transmitted pose data:

Prioritizing key poses for efficient frame-by-frame animation
Generating continuous movements from discrete pose data
Rendering the human figure within a simplified or static background

Solution & Results

Video Input Processing

Pose Detection & Feature Extraction

Data Transmission (Minimized Payload)

Animation Reconstruction

Rendering & Display

Our implementation demonstrated several key advantages:

Significant reduction in data transmission requirements (up to 90% less data compared to standard video codecs)
Decreased net latency in bandwidth-constrained environments
Maintained focus on the most important visual elements (human figures)
Successful real-time implementation through local connections

While our current implementation has been tested primarily through local connections, the results indicate substantial potential for reducing latency in real-world video conferencing applications, particularly in situations with limited bandwidth.

Future Directions

Our research has opened several promising avenues for future exploration:

Expanding implementation to cross-device communication over the internet
Refining pose detection algorithms for improved accuracy and efficiency
Developing more sophisticated animation techniques for natural-looking reconstruction
Integrating facial expression recognition for enhanced communication fidelity
Testing in various bandwidth-constrained environments to quantify performance improvements
Exploring additional applications beyond video conferencing (e.g., remote education, telemedicine)

This work represents a fundamental shift in how we approach video streaming, with potential implications for a wide range of real-time communication applications.