Enabling QoEbased Scheduling for Video Teleconferencing
via PSNR Time Series Prediction
Liangping Maa, Yuriy Reznika, Rahul Vanama, and Gregory Sternbergb
aInterDigital Communications, Inc., San Diego, CA 92121, USA;
bInterDigital Communications, Inc., King of Prussia, PA 19406, USA
ABSTRACT
The ultimate goal of network resource allocation for video teleconferencing is to optimize the Quality of Experi
ence (QoE) of the video. The IPPP video coding structure with macroblock intra refresh is widely used for video
teleconferencing. With such video coding structure, the loss of a frame generally causes error propagation to
subsequent frames. A resource allocation decision of a communication network determines the QoE given that
other conditions such as viewing conditions are fixed. Therefore, to optimize the QoE, a communication network
needs to be able to accurately predict the QoE for each of its resource allocation decisions and then selects the
decision corresponding to the best QoE. In our previous work, we reduced the QoE prediction problem to one of
predicting the perframe PSNR time series. The accuracy of the proposed perframe PSNR prediction method
was demonstrated, however, only for low resolution video sequences. In this paper, we show via simulations
that the perframe PSNR prediction method achieves good performance for higher resolution video sequences as
well.
Keywords: QoE, video, prediction, scheduling, network.
1. INTRODUCTION
In our previous work,1 we proposed a QoE prediction scheme for video teleconferencing and evaluated the
scheme on Common Intermediate Format (CIF) video sequences, which has a resolution of 352 × 288 pixels.
This resolution is low, especially when considering the fact that the stateoftheart mobile devices, such as smart
phones, are able to support much higher resolutions. This motivates us to evaluate the proposed scheme1 with
higher resolution videos.
For the purpose of completeness, we give the background for our previous work.1 Video teleconferencing is
generally real time and widely uses the IPPP video coding structure, where the first frame is an intracoded frame
and each P frame uses the frame immediately preceding it as the reference for motion compensated prediction.
To meet the stringent delay requirement, the encoded video is typically delivered by the RTP/UDP protocol,
which is lossy in nature. When a packet loss occurs, the associated video frame as well as the subsequent
frames will be affected, which is called error propagation. Packet loss information can be fed back to the
video sender or multipoint control unit (MCU) (which may perform transcoding) via protocols such as RTP
Control Protocol (RTCP) to trigger videoside adaptation such as the insertion of an intracoded frame to stop
error propagation. However, the feedback delay is at least about a round trip time (RTT). To alleviate error
propagation, additionally, macroblock intra refresh – encoding some macroblocks of each video frame in the intra
mode – is often used.
We realize that a video frame generally is mapped to one or multiple packets (or slices in the case of
H.264/AVC) and thus a packet loss does not necessarily lead to the loss of a whole frame. We focus on whole
frame losses, and leave the more general packet losses for future work.
Although there is no difference between the P frames in the IPPP video coding structure, the impact of
dropping a P frame can be dramatically different from P frame to P frame. For the purpose of illustration,
each time, we drop a frame, apply a simple error concealment technique (which is frame copy), and evaluate the
MSE of the immediately affected 10 decoded frames consisting of the dropped frame and the 9 frames thereafter.
The reason of looking at only 10 frames instead of the whole video is that RTCP is typically used in video
teleconferencing, and the video receiver can inform the video sender of a packet loss if a packet gets lost. For
Applications of Digital Image Processing XXXVI, edited by Andrew G. Tescher,
Proc. of SPIE Vol. 8856, 88560I · © 2013 SPIE · CCC code: 0277786X/13/$18
doi: 10.1117/12.2026910
Proc. of SPIE Vol. 8856 88560I1
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
a frame rate of 30 frames per second, 10 frames correspond to a RTCP feedback delay of 1/3 second. One
potential cause of the variation of the impact of dropping a P frame could be due to the differences in the
sizes of the P frames. For a communication network, the impact should be evaluated against the same network
resource that would be consumed. Therefore, we normalize the impact by the frame size. As an example, Fig. 1
shows the average MSE per 1000byte data as a function of the dropped frame for the RaceHorses video with
resolution 832 × 480 pixels, encoded in H.264/AVC with a quantization parameter (QP)= 30. The normalized
MSE corresponding to dropping frame 30 is 1429 while that corresponding to dropping frame 39 is 449. The
difference is more than 3 times. This presents an opportunity for a communication network to intelligently drop
certain video packets in the event of network congestion to optimize the video quality.
5 10 15 20 25 30 35 40 45 50
400
600
800
1000
1200
1400
1600
Frame Dropped
M
SE
p
er
K
B
Da
ta
Figure 1. The impact of dropping a P frame: the average MSE per 1000byte data for the immediately affected 10 frames
as a function of dropped frame for the RaceHorse video.
The QoE prediction scheme that we proposed1 is low in delay, computational complexity and communication
overhead to enable a network to allocate network resources so as to optimize the QoE. Specifically, with such a
scheme, the network knows the resulting QoE for each resource allocation decision (e.g., dropping certain frames
in the network) so that it can do optimal resource allocation by selecting the decision corresponding to the best
QoE. Because of the predictive nature of the problem and the fact that the prediction has to be done within a
communication network, we need a QoE model that is feasible to compute, which motivates us to consider those
(e.g.,23) that use the perframe PSNR time series as the input. A perframe PSNR time series is a sequence of
PSNR values, each indexed by its corresponding frame number. The QoE prediction problem then reduces to
one of predicting the perframe PSNR. The proposed QoE prediction scheme is jointly implemented by the video
sender (or MCU) and the communication network. Simulation results show that the proposed perframe PSNR
prediction method can achieve an average error much less than 1dB.
To predict the perframe PSNR, we can look at the channel distortion and the source distortion separately.
The challenge for PSNR prediction is channel distortion. We briefly review related work on channel distortion.
Some aspects of the channel distortion model in the seminal work4 are adopted in our work. An additive
exponential model proposed in5 is shown to have good performance. However, the determination of the model
requires some information (the motion reference ratio) of the predicted video frames to be known a priori. This
is possible only if the encoder generates all the video frames up to the predicted frame, introducing significant
delay. For example, to predict the channel distortion 10 frames ahead, assuming a frame rate of 30 frames per
second, the delay will be 333 ms. A model taking into account the crosscorrelation among multiple frame losses
is proposed for channel distortion.6 However, in the parameter estimation, the whole video sequence needs to
be known in advance, making it infeasible for real time applications. Pixellevel channel distortion prediction
models are proposed for optimizing the video encoder,7 which, although accurate, are an overkill for the problem
we look at. Thus, in this paper, we consider the simpler framelevel distortion prediction.
Proc. of SPIE Vol. 8856 88560I2
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
The remainder of the paper is organized as follows. Section 2 describes the QoE prediction scheme, Section
3 gives the simulation results on the perframe PSNR prediction for higher resolution videos, and Section 4
concludes the paper.
2. QOE PREDICTION
For completeness, we introduce the QoE prediction scheme we proposed previously.1
2.1 Choosing QoE Models
Subjective video quality testing is the ultimate method to measure the video quality perceived by the human
visual system (HVS). However, subjective testing requires playing the video to a group of human subjects in
stringent testing conditions8 and collecting the ratings of the video quality, which is time consuming, expensive
and unable to provide realtime assessment results a posteriori, not to mention predicting the video quality.
Alternatively, QoE models can be constructed by relating QoS metrics to video QoE.2, 3, 9, 10 The ITU rec
ommendation G.107010 considers the packet loss rate rather than the packet loss pattern in modeling the QoE,
which is insufficient as indicated by our example in Section 1 where the pattern is a single frame loss. The model
in9 has the same problem. The ITU recommendation G.1070 also requires extensive oﬄine subjective testing to
construct a large number of QoE models, and extract certain video features (e.g., degree of motion)11 during
prediction for desired accuracy, making it unsuitable for realtime applications.
The QoE model proposed in2 uses the statics extracted from the perframe peak signaltonoise ratio (PSNR)
time series, which are QoS metrics, as the model input. Some of the statistics are the minimum, maximum,
standard deviation, the 90% and the 10% percentiles, and the difference in PSNR between two consecutive
frames. Although the average PSNR of a video sequence is generally considered a flawed video quality metric,
the model in2 is shown to outperform Video Quality Metric (VQM)12 and Structural Similarity (SSIM)13 in
terms of correlation to subjective testing results. With the choice of such QoE models, the QoE prediction
problem reduces to one that predicts the perframe PSNR time series.
2.2 The Proposed QoE Prediction Approach
Before discussing various approaches to QoE prediction, we note that the pattern of packet losses is important,
because the video quality, or specifically statistics of the perframe PSNR time series, depends on not only how
many frame losses have occurred, but also where they have occurred in the video sequence.
There are three approaches to QoE prediction. In a senderonly approach, the perframe PSNR time series
for each frame loss pattern is obtained by simulation at the video sender. However, the number of frame loss
patterns grows exponentially with the number of video frames. Even if the amount of computation is not an issue,
the resulting perframe PSNR time series need to be sent to the communication network, generating excessive
communication overhead.
In a networkonly approach, the network decodes the video and finds out the channel distortion for different
packet loss patterns. However, the video quality depends on not only the channel distortion, but also the
distortion from source coding. Due to lack of access to the original video, it is impossible for the network to know
about the source distortion, making the QoE prediction inaccurate. Also, this approach becomes unscalable when
the network serves a very large number of video teleconferencing sessions simultaneously. Finally, this approach
may not be suitable when the video packets are encrypted.
We propose a joint approach that involves both the video sender (or MCU) and the network. The video sender
obtains the channel distortion for single frame losses, and passes the results along with the source distortion to
the network. The network knows the QoE model, and for each resource allocation decision, it calculates the total
distortion for each frame (and hence the perframe PSNR time series) by utilizing the linearity assumption for
multiple frame losses4.5 This approach eliminates virtually all the communication overhead in the senderonly
approach, takes into account source distortion absent in the networkonly approach, and does not need to do
video encoding/decoding in the network.
Proc. of SPIE Vol. 8856 88560I3
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
Į Ȗ
Į Ȗ
Video
Encoder
(delay by t1)
D D
Video
Decoder
(delay t2)
D...
Delay by
t1+t2
+

Channel Distortion Model
(delay by t3)
Delay by
t3
AnnotationF(n)
G(n) delayed by t1 sec
ds(n) delayed
by (t1 + t2) sec
d0(n), Į(nm), Ȗ(nm)
Network
m delay units
F(n) delayed
by (t1 + t2) sec
Delay by
t3t2
Figure 2. System architecture of the video sender.
2.3 Perframe PSNR Prediction
As mentioned before, framelevel channel distortion prediction is appropriate for the problem we focus on. We
first look at the video sender side, whose architecture is shown in Fig. 2. Let the number of pixels in a frame
be N . Let F (n), a vector of length N , be the nth original frame, and F (n, i) denote pixel i of F (n). Let Fˆ (n)
be the reconstructed frame without frame loss corresponding to F (n), and Fˆ (n, i) be pixel i of Fˆ (n). Original
video frame F (n) is fed into the video encoder, which generates packet G(n) after a delay of t1 seconds. Packet
G(n) may represent multiple NAL units, which we call together a packet for convenience. Packet G(n) is then
fed into the video decoder to generate the reconstructed frame Fˆ (n), which takes t2 seconds. Note that in a
typical video encoder, this reconstruction is already in place. Let the distortion due to source coding for F (n)
be ds(n). Then,
ds(n) =
N∑
i=1
(F (n, i)− Fˆ (n, i))2/N, (1)
which is readily available at the video encoder.
As mentioned earlier, the construction of the channel distortion model in5 requires some information (the
motion reference ratio) of the predicted video frames to be known in advance, which results in significant
delay. To address this problem, we propose using the current packet G(n) and the previously generated packets
G(n − 1), ..., G(n −m) to train a channel distortion model. In Fig. 2, D represents a delay of an interframe
time interval. The training takes t3 seconds. Note that t3 ≥ t2, because the Channel Distortion Model needs
to decode at least one frame. The values of the parameters for the model are then sent to the Annotation
block for annotation. Also annotated is the source distortion ds(n). The annotated packet is then sent to the
communication network.
We now look at the details of the channel distortion model. Prior results show that a linearity model performs
well in practice4.5 For each frame loss, we define function h(k, l),5 which models how much distortion the loss
of frame k causes to frame l for l ≥ k
h(k, l) = d0(k)
e−α(k)(l−k)
1 + γ(k)(l − k)
(2)
where d0(k) is the channel distortion for frame k, resulting from the loss of frame k only and the error concealment,
and α(k) and γ(k) are parameters dependent on frame k. In this paper, we consider a simple error concealment
Proc. of SPIE Vol. 8856 88560I4
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
scheme, namely, frame copy. Hence the distortion due to the loss of frame k (and only frame k) is
d0(k) =
N∑
i=1
(Fˆ (k, i)− Fˆ (k − 1, i))2/N. (3)
In (2), γ(k) is called leakage, which describes the efficiency of loop filtering to remove the artifacts introduced
by motion compensation and transformation.4 The term e−α(k)(l−k) captures the error propagation in the case
of pseudorandom macroblock intra refresh. In,4 a linear function (1 − (l − k)β), where β is the intra refresh
rate, is proposed. We do not use this linear function, because the macroblock intra refresh scheme in4 is cyclic,
while the one used in our simulation software (JVT JM 16.214) is pseudorandom. The linear model states that
the impact vanishes after 1/β frames (the intra refresh update interval for the cyclic scheme), which is not the
case for the pseudorandom scheme as suggested by our simulation results. An exponential model such as the
one in5 is better. However, the model in5 fails to capture the impact of loop filtering, while our model does
capture it. The values of α(k) and γ(k) can be obtained by methods such as least squares or least absolute
value via fitting simulation data. In Fig. 2, the video sender drops packet G(n −m) from the packet sequence
G(n), G(n − 1), ..., G(n−m), performs video decoding, measures the channel distortions, and finds the value of
α(n −m) (defined as αˆ(n −m)) and the value of β(n −m) (defined as γˆ(n −m)) in (2) with the substitution
k = n−m that minimize the error between the measured distortions and the predicted distortions.
We next look at the network side. We assume that the network has packets G(n), G(n − 1), ..., G(n − L)
available. Let I(k) be the indicator function, being 1 if frame k is dropped and 0 otherwise. A packet loss
pattern can be characterized by a sequence of I(k)’s. For convenience we denote a pattern by a vector P :=
(I(n), I(n− 1), ..., I(0)). The channel distortion of frame l ≥ n− L resulting from P is then predicted as
dˆc(l, P ) =
l∑
k=0
I(k)hˆ(k, l), (4)
where the linearity assumption for multiple frame losses in45 is used, and
hˆ(k, l) = d0(k)
e−αˆ(k−m)(l−k)
1 + γˆ(k −m)(l − k)
. (5)
We realize that the model in (4) can be improved, for example, by considering the crosscorrelation of frame
losses.6 However, as mentioned earlier, the model in6 is not suitable for real time applications, and its complexity
is very high. The simple model in (4) proves to be reasonably accurate.4, 5
In order to predict the perframe PSNR for a particular packet loss pattern P , the network needs to know
about the source distortion as well. The total distortion prediction can be represented as
dˆ(l, P ) = dˆc(l, P ) + dˆs(l), (6)
where dˆs(l) = ds(l) for n ≥ l ≥ n − L, and dˆs(l) = ds(n) for l > n, and where we have applied the assumption
that the channel distortion and the source distortion are independent, which is shown to be pretty accurate.15
Note that the source distortion estimation dˆs(l) for n ≥ l ≥ n − L is precise and readily available at the video
sender and is included in the annotation of the L+ 1 packets G(n), ..., G(n− L).
The PSNR prediction for frame l ≥ n− L with packet loss pattern P is then
̂PSNR(l, P ) = 10 log10(2552/dˆ(l, P )). (7)
The perframe PSNR time series is then { ̂PSNR(l, P )}, where l is the time index. The time series is a
function of P . Thus, to generate the best time series, the network chooses the optimal P among those that are
feasible under the resource constraint. Note that, part of P , i.e., I(n − L − 1), I(n− L − 2), ..., I(0), is already
determined, because a frame between 0 and n−L−1 has been either delivered or dropped. The variables subject
to optimization are the remaining part of P , i.e., I(n − L), ..., I(n). We define the prediction length λ as the
number of frames to be predicted. That is, if the nth frame is to be dropped, then the predictor predicts frames
n through n+λ. Note that, it is not necessary to predict many frames, since it takes the video encoder not more
than one RTT to receive feedback about a frame loss.
Proc. of SPIE Vol. 8856 88560I5
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
3. SIMULATION RESULTS
We evaluate the performance of the proposed perframe PSNR prediction method via simulation. We consider
both single frame losses and multiple frame losses. The 832 × 480 RaceHorses video sequence is used. For
m = 10, L = 5, and λ = 8, Fig. 3 shows the prediction for frames l ≥ 49 if frame 49 is dropped, and Fig. 4 for
frames l ≥ 45 if frames 45 and 48 are dropped.
0 10 20 30 40 50
18
20
22
24
26
28
30
32
34
36
Frame number
PS
NR
(d
B)
Actual
Predicted
Figure 3. The perframe PSNR prediction for a single frame loss at frame 49.
0 5 10 15 20 25 30 35 40 45 50
16
18
20
22
24
26
28
30
32
34
36
Frame number
PS
NR
(d
B)
Actual
Predicted
Figure 4. The perframe PSNR prediction for two frame losses at frames 45 and 48.
For single frame losses, we drop a frame, evaluate the prediction errors, and then drop the next frame and
so on. The prediction error is measured by the absolute perframe PSNR prediction error, which is defined as
the absolute value of the difference between the actual perframe PSNR and the predicted value, both in dB.
We plot the cumulative distribution function (CDF) of the absolute perframe PSNR prediction error in Fig. 5.
We consider prediction length of 8 (blue dashed lines) and of 5 (red solid lines). The mean value of the absolute
prediction error are 0.74dB and 0.52dB for prediction lengths 8 and 5, respectively.
For multiple losses, we consider a particular frame loss pattern: two frame losses with a gap of 2 frames in
Proc. of SPIE Vol. 8856 88560I6
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
between. The CDF of the absolute perframe PSNR prediction error is shown in Fig. 6. The mean value of the
absolute prediction error are 0.88dB and 0.63dB for prediction lengths 8 and 5, respectively.
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Per−frame PSNR Prediction Error (dB)
CD
F
L=8
L=5
Figure 5. The CDF of perframe PSNR prediction error for single frame losses.
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Per−frame PSNR Prediction Error (dB)
CD
F
L=5
L=8
Figure 6. The CDF of perframe PSNR prediction error for two frame losses with a gap of 2 frames in between.
4. CONCLUSION
In our previous work, we proposed a QoE prediction scheme that allows a communication network to optimize
resource allocation for video teleconferencing. By using QoE models that take the perframe PSNR time series
as the input, the QoE prediction problem reduces to a perframe PSNR prediction problem. Simulation results
showed that the proposed perframe PSNR prediction method achieves an average prediction error well below
1dB for the relatively low resolution CIF videos. In this paper, we show that similar performance is achieved for
higher (4×) resolution videos.
Proc. of SPIE Vol. 8856 88560I7
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
Acknowledgements
The authors would like to thank Dr. Robert A. DiFazio and Mr. Christopher Wallace of InterDigital Innovation
Labs for helpful comments.
REFERENCES
1. L. Ma, T. Xu, G. Sternberg, A. Balasubramanian, and A. Zeira, “Modelbased QoE prediction to enable
better user experience for video teleconferencing,” in IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), May 2013.
2. C. Keimel, T. Oelbaum, and K. Diepold, “Improving the prediction accuracy of video quality metrics,”
in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas,
March 2010, pp. 2442–2445.
3. C. Keimel, M. Rothbucher, H. Shen, and K. Diepold, “Video is a cube: Multidimensional analysis and video
quality metrics,” IEEE Signal Processing Magzine, pp. 41–49, Nov. 2011.
4. K. Stuhlmuller, N. Farber, M. Link, and B. Girod, “Analysis of video transmission over lossy channels,”
IEEE Journal on Selected Areas in Communications, vol. 18, no. 6, pp. 1012–1032, June 2000.
5. U. Dani, Z. He, and H. Xiong, “Transmission distortion modeling for wireless video communication,” in
IEEE Global Telecommunications Conference (GLOBECOM), Dec 2005.
6. Y. J. Liang, J. G. Apostolopoulos, and B. Girod, “Analysis of packet loss for compressed video: Effect of
burst losses and correlation between error frames,” IEEE Trans. Circuits and Systems for Video Technology,
vol. 18, no. 7, pp. 861–874, July 2008.
7. Z. Chen and D. Wu, “Prediction of transmission distortion for wireless video communication: Algorithm
and application,” Journal of Visual Communication and Image Representation, vol. 21, no. 8, pp. 948–964,
Nov. 2010.
8. Subjective Video Quality Assessment Methods for Multimedia Applications, ITUT RecommendationP.910,
Sep. 1999.
9. M. Venkataraman and M. Chatterjee, “Inferring video QoE in real time,” IEEE Network, pp. 4–13, Jan./Feb.
2011.
10. Opinion Model for VideoTelephony Applications, ITUT Recommendation G.1070, 2007.
11. Jose Joskowicz and J. Carlos Lpez Ardao, “Enhancements to the opinion model for videotelephony appli
cations,” in Proceedings of the 5th International Latin American Networking Conference (LANC), 2009, pp.
87–94.
12. M. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,,” IEEE
Trans. Broadcasting, vol. 50, no. 3, pp. 312322, Sep. 2004.
13. Z. Wang, L. Lu, and A. C. Bovik, “video quality assessment based on structural distortion measurement,”
Signal Processing: Image Communication, vol. 19, no. 2, pp. 121–132, Feb 2004.
14. ITU, “H.264/AVC reference software,” Online, Oct 2012, iphome.hhi.de/suehring/tml/download/.
15. Zhihai He, Jianfei Cai, and Chang Wen Chen, “Joint source channel ratedistortion analysis for adaptive
mode selection and rate control in wireless video coding,” IEEE Trans. Circuits and Systems for Video
Technology, vol. 12, no. 6, pp. 511–523, June 2002.
Proc. of SPIE Vol. 8856 88560I8
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/29/2014 Terms of Use: http://spiedl.org/terms
InterDigital develops mobile technologies that are at the core of devices, networks, and services worldwide. We solve many of the industry's most critical and complex technical challenges, inventing solutions for more efficient broadband networks and a richer multimedia experience years ahead of market deployment.