MODELBASED QOE PREDICTION TO ENABLE BETTER USER EXPERIENCE FOR
VIDEO TELECONFERENCING
Liangping Ma? Tianyi Xu† Gregory Sternberg‡ Anantharaman Balasubramanian? Ariela Zeira?
? InterDigital Communications, Inc., San Diego, CA 92121, USA
† Dept. of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716, USA
‡InterDigital Communications, Inc., King of Prussia, PA 19406, USA
ABSTRACT
The ultimate goal of network resource allocation for video
teleconferencing is to optimize the Quality of Experience
(QoE) of the video. We consider the IPPP video coding
structure with macroblock intra refresh, which is widely used
for video teleconferencing. Generally, the loss of a current
frame causes error propagation to subsequent frames due to
the video coding structure. Therefore, to optimize the QoE,
a communication network needs to be able to accurately
predict the consequence of each of its resource allocation
decisions. We propose a QoE prediction scheme by consid
ering QoE models that use the perframe PSNR time series
as the input, thus reducing the QoE prediction problem to
a perframe PSNR prediction problem. The QoE prediction
scheme is jointly implemented by the video sender (or MCU)
and the communication network. Simulation results show
that the proposed perframe PSNR prediction method is fairly
accurate, with an average error well below 1dB.
Index Terms— QoE, video, prediction, scheduling, net
work.
1. INTRODUCTION
In realtime video applications such as video teleconferenc
ing, the IPPP video coding structure is widely used, where
the first frame is an intracoded frame, and each P frame uses
the frame immediately preceding it as the reference for mo
tion compensated prediction. To meet the stringent delay re
quirement, the encoded video is typically delivered by the
RTP/UDP protocol, which is lossy in nature. When a packet
loss occurs, the associated video frame as well as the subse
quent frames will be affected, which is called error propaga
tion. Packet loss information can be fed back to the video
sender or multipoint control unit (MCU) (which may per
form transcoding) via protocols such as RTP Control Protocol
(RTCP) to trigger the insertion of an intracoded frame to stop
error propagation. However, the feedback delay is at least
about a round trip time (RTT). To alleviate error propagation,
additionally, macroblock intra refresh – encoding some mac
roblocks of each video frame in the intra mode – is often used.
We realize that a video frame generally is mapped to one
or multiple packets (or slices in the case of H.264/AVC) and
thus a packet loss does not necessarily lead to the loss of a
whole frame. In this paper, we focus on whole frame losses,
and leave the more general packet losses for future work.
Although there is no difference in the video coding
scheme for the P frames, the impact of a frame loss can
be drastically different from frame to frame. As an exam
ple, Fig. 1 shows the average loss in PSNR for the subse
quent frames if a P frame is dropped in the network for the
ForemanCIF sequence encoded in H.264/AVC with a quan
tization parameter (QP)= 30, where dropping frame 20 alone
leads to a loss of 2.7dB, while dropping frame 22 alone leads
to a loss of 5.9dB. This presents an opportunity for a commu
nication network to intelligently drop certain video packets in
the event of network congestion to optimize the video quality.
0 20 40 60 80 1000
2
4
6
8
10
Frame Number
Av
er
ag
e
Lo
ss
in
P
SN
R
(dB
)
Fig. 1. The impact of a frame loss on the average PSNR of
subsequent frames for the ForemanCIF video sequence.
In this paper, we focus on video Quality of Experience
(QoE), and propose a QoE prediction scheme that is low in
delay, computational complexity and communication over
head to enable a network to allocate network resources so as
to optimize the QoE. Specifically, with such a scheme, the
network knows the resulting QoE for each resource allocation
decision (e.g., dropping certain frames in the network) so that
it can do optimal resource allocation by choosing the decision
corresponding to the best QoE. Because of the prediction na
ture of the problem and the fact that the prediction has to be
done within a communication network, we need a QoE model
that is feasible to compute, which motivates us to consider
those (e.g., [1][2]) that use the perframe PSNR time series
as the input. A perframe PSNR time series is a sequence of
PSNR values, each indexed by its corresponding frame num
ber. The QoE prediction problem then reduces to one of pre
dicting the perframe PSNR. The proposed QoE prediction
scheme is jointly implemented by the video sender (or MCU)
and the communication network. Simulation results show that
the proposed perframe PSNR prediction method can achieve
an average error much less than 1dB.
We briefly review related work on channel distortion,
which is the challenge for predicting the perframe PSNR.
Some aspects of the channel distortion model in the seminal
work [3] are adopted in our work. An additive exponential
model proposed in [4] is shown to have good performance.
However, the determination of the model requires some in
formation (the motion reference ratio) of the predicted video
frames to be known a priori. This is possible only if the
encoder generates all the video frames up to the predicted
frame, introducing significant delay. For example, to predict
the channel distortion 10 frames ahead, assuming a frame rate
of 30 frames per second, the delay will be 333 ms. In [5], a
model taking into account the crosscorrelation among multi
ple frame losses is proposed for channel distortion. However,
in the parameter estimation, the whole video sequence needs
to be known in advance, making it infeasible for real time
applications. Pixellevel channel distortion prediction models
are proposed for optimizing the video encoder [6], which,
although accurate, are an overkill for the problem we look
at. Thus, in this paper, we consider the simpler framelevel
distortion prediction.
The remainder of the paper is organized as follows. Sec
tion 2 describes the QoE prediction scheme, Section 3 gives
the simulation results on the perframe PSNR prediction, and
Section 4 concludes the paper.
2. QOE PREDICTION
2.1. Choosing QoE Models
Subjective video quality testing is the ultimate method to
measure the video quality perceived by the human visual
system (HVS). However, subjective testing requires playing
the video to a group of human subjects in stringent testing
conditions [7] and collecting the ratings of the video quality,
which is time consuming, expensive, and unable to provide
realtime assessment results, not to mention predicting the
video quality.
Alternatively, QoE models can be constructed by relating
QoS metrics to video QoE [1][2][8][9]. The ITU recommen
dation G.1070 [9] considers the packet loss rate rather than the
packet loss pattern in modeling the QoE, which is insufficient
as indicated by our example in Section 1 where the pattern is
a single frame loss. The model in [8] has the same problem.
The ITU recommendation G.1070 also requires extensive of
fline subjective testing to construct a large number of QoE
models, and extracting certain video features (e.g., degree of
motion) [10] during prediction for desired accuracy, making
it unsuitable for realtime applications.
The QoE model proposed in [1] uses the statics extracted
from the perframe peak signaltonoise ratio (PSNR) time
series, which are QoS metrics, as the model input. Some
of the statistics are the minimum, maximum, standard devi
ation, the 90% and the 10% percentiles, and the difference
in PSNR between two consecutive frames. Although the av
erage PSNR of a video sequence is generally considered a
flawed video quality metric, the model in [1] is shown to
outperform Video Quality Metric (VQM) [11] and Structural
Similarity (SSIM) [12] in terms of correlation to subjective
testing results. With the choice of such QoE models, the QoE
prediction problem reduces to one that predicts the perframe
PSNR time series.
2.2. The Proposed QoE Prediction Approach
Before discussing various approaches to QoE prediction, we
note that the pattern of packet losses is important, because the
video quality, or specifically statistics of the perframe PSNR
time series, depends on not only how many frame losses have
occurred, but also where they have occurred in the video se
quence.
There are three approaches to QoE prediction. In a
senderonly approach, the perframe PSNR time series for
each frame loss pattern is obtained by simulation at the video
sender. However, the number of frame loss patterns grows
exponentially with the number of video frames. Even if
the amount of computation is not an issue, the resulting per
frame PSNR time series need to be sent to the communication
network, generating excessive communication overhead.
In a networkonly approach, the network decodes the
video and finds out the channel distortion for different packet
loss patterns. However, the video quality depends on not only
the channel distortion, but also the distortion from source
coding. Due to lack of access to the original video, it is im
possible for the network to know about the source distortion,
making the QoE prediction inaccurate. Also, this approach
becomes unscalable when the network serves a very large
number of video teleconferencing sessions simultaneously.
Finally, this approach may not be suitable when the video
packets are encrypted.
We propose a joint approach that involves both the video
sender (or MCU) and the network. The video sender obtains
the channel distortion for single frame losses, and passes the
results along with the source distortion to the network. The
network knows the QoE model, and for each resource alloca
tion decision, it calculates the total distortion for each frame
(and hence the perframe PSNR time series) by utilizing the
linearity assumption for multiple frame losses [3][4]. This ap
proach eliminates virtually all the communication overhead in
the senderonly approach, takes into account source distortion
absent in the networkonly approach, and does not need to do
video encoding/decoding in the network.
Į Ȗ
Į Ȗ
Video
Encoder
(delay by t1)
D D
Video
Decoder
(delay t2)
D...
Delay by
t1+t2
+

Channel Distortion Model
(delay by t3)
Delay by
t3
AnnotationF(n)
G(n) delayed by t1 sec
ds(n) delayed
by (t1 + t2) sec
d0(n), Į(nm), Ȗ(nm)
Network
m delay units
F(n) delayed
by (t1 + t2) sec
Delay by
t3t2
Fig. 2. System architecture of the video sender.
2.3. Perframe PSNR Prediction
As mentioned before, framelevel channel distortion predic
tion is appropriate for the problem we focus on. We first look
at the video sender side, whose architecture is shown in Fig. 2.
Let the number of pixels in a frame be N . Let F (n), a vector
of length N , be the nth original frame, and F (n, i) denote
pixel i of F (n). Let Fˆ (n) be the reconstructed frame with
out frame loss corresponding to F (n), and Fˆ (n, i) be pixel
i of Fˆ (n). Original video frame F (n) is fed into the video
encoder, which generates packet G(n) after a delay of t1 sec
onds. Packet G(n) may represent multiple NAL units, which
we call together a packet for convenience. Packet G(n) is
then fed into the video decoder to generate the reconstructed
frame Fˆ (n), which takes t2 seconds. Note that in a typical
video encoder, this reconstruction is already in place. Let the
distortion due to source coding for F (n) be ds(n). Then,
ds(n) =
N∑
i=1
(F (n, i)− Fˆ (n, i))2/N, (1)
which is readily available at the video encoder.
As mentioned earlier, the construction of the channel dis
tortion model in [4] requires some information (the motion
reference ratio) of the predicted video frames to be known in
advance, which results in significant delay. To address this
problem, we propose using the current packet G(n) and the
previously generated packetsG(n−1), ..., G(n−m) to train a
channel distortion model. In Fig. 2, D represents a delay of an
interframe time interval. The training takes t3 seconds. Note
that t3 ≥ t2, because the Channel Distortion Model needs
to decode at least one frame. The values of the parameters
for the model are then sent to the Annotation block for an
notation. Also annotated is the source distortion ds(n). The
annotated packet is then sent to the communication network.
We now look at the details of the channel distortion
model. Prior results show that a linearity model performs
well in practice [3][4]. For each frame loss, we define func
tion h(k, l) [4], which models how much distortion the loss
of frame k causes to frame l for l ≥ k
h(k, l) = d0(k)
e−α(k)(l−k)
1 + γ(k)(l − k)
(2)
where d0(k) is the channel distortion for frame k, resulting
from the loss of frame k only and the error concealment,
and α(k) and γ(k) are parameters dependent on frame k. In
this paper, we consider a simple error concealment scheme,
namely, frame copy. Hence the distortion due to the loss of
frame k (and only frame k) is
d0(k) =
N∑
i=1
(Fˆ (k, i)− Fˆ (k − 1, i))2/N. (3)
In (2), γ(k) is called leakage, which describes the ef
ficiency of loop filtering to remove the artifacts introduced
by motion compensation and transformation [3]. The term
e−α(k)(l−k) captures the error propagation in the case of
pseudorandom macroblock intra refresh. In [3], a linear
function (1 − (l − k)β), where β is the intra refresh rate,
is proposed. We do not use this linear function, because
the macroblock intra refresh scheme in [3] is cyclic, while
the one used in our simulation software (JVT JM 16.2 [13])
is pseudorandom. The linear model states that the impact
vanishes after 1/β frames (the intra refresh update interval
for the cyclic scheme), which is not the case for the pseudo
random scheme as suggested by our simulation results. An
exponential model such as the one in [4] is better. However,
the model in [4] fails to capture the impact of loop filtering,
while our model does capture it. The values of α(k) and
γ(k) can be obtained by methods such as least squares or
least absolute value via fitting simulation data. In Fig. 2,
the video sender drops packet G(n − m) from the packet
sequence G(n), G(n− 1), ..., G(n−m), performs video de
coding, measures the channel distortions, and finds the value
of α(n−m) (defined as αˆ(n−m)) and the value of β(n−m)
(defined as γˆ(n−m)) in (2) with the substitution k = n−m
that minimize the error between the measured distortions and
the predicted distortions.
We next look at the network side. We assume that the net
work has packetsG(n), G(n−1), ..., G(n−L) available. Let
I(k) be the indicator function, being 1 if frame k is dropped
and 0 otherwise. A packet loss pattern can be characterized
by a sequence of I(k)’s. For convenience we denote a pattern
by a vector P := (I(n), I(n− 1), ..., I(0)). The channel dis
tortion of frame l ≥ n−L resulting from P is then predicted
as
dˆc(l, P ) =
l∑
k=0
I(k)hˆ(k, l), (4)
where the linearity assumption for multiple frame losses in
[3][4] is used, and
hˆ(k, l) = d0(k)
e−αˆ(k−m)(l−k)
1 + γˆ(k −m)(l − k)
. (5)
We realize that the model in (4) can be improved, for exam
ple, by considering the crosscorrelation of frame losses [5].
However, as mentioned earlier, the model in [5] is not suitable
for real time applications, and its complexity is very high. The
simple model in (4) proves to be reasonably accurate [3][4].
In order to predict the perframe PSNR for a particular
packet loss pattern P , the network needs to know about the
source distortion as well. The total distortion prediction can
be represented as
dˆ(l, P ) = dˆc(l, P ) + dˆs(l), (6)
where dˆs(l) = ds(l) for n ≥ l ≥ n − L, and dˆs(l) = ds(n)
for l > n, and where we have applied the assumption that the
channel distortion and the source distortion are independent,
which is shown to be pretty accurate [14]. Note that the source
distortion estimation dˆs(l) for n ≥ l ≥ n − L is precise and
readily available at the video sender and is included in the
annotation of the L+ 1 packets G(n), ..., G(n − L).
The PSNR prediction for frame l ≥ n − L with packet
loss pattern P is then
P̂SNR(l, P ) = 10 log10(255
2/dˆ(l, P )). (7)
The perframe PSNR time series is then {P̂SNR(l, P )},
where l is the time index. The time series is a function of P .
Thus, to generate the best time series, the network chooses the
optimal P among those that are feasible under the resource
constraint. Note that, part of P , i.e., I(n−L− 1), I(n−L−
2), ..., I(0), is already determined, because a frame between
0 and n − L − 1 has been either delivered or dropped. The
variables subject to optimization are the remaining part of P ,
i.e., I(n − L), ..., I(n). We define the prediction length λ as
the number of frames to be predicted. That is, if the nth frame
is to be dropped, then the predictor predicts frames n through
n + λ. Note that, it is not necessary to predict many frames,
since it takes the video encoder not more than one RTT to
receive feedback about a frame loss.
3. SIMULATION RESULTS
We evaluate the performance of the proposed perframe
PSNR prediction method via simulation. We consider both
single frame losses and multiple frame losses. The Foreman
CIF video sequence is used. For m = 10, L = 5, and λ = 8,
Fig. 3(a) shows the prediction for frames l ≥ 36 if frame 36 is
dropped, and Fig. 3(b) for frames l ≥ 67 if frames 67 and 70
are dropped. More simulation results are shown in Fig. 4. We
plot the cumulative distribution function (CDF) of the abso
lute perframe PSNR prediction error, i.e., the absolute value
10 15 20 25 30 35 4024
26
28
30
32
34
36
38
Frame number
PS
NR
(d
B)
Actual
Predicted
20 30 40 50 60 7024
26
28
30
32
34
36
38
Frame number
PS
NR
(d
B)
Actual
Predicted
(a) (b)
Fig. 3. The perframe PSNR prediction for (a) a single frame
loss at frame 36; (b) two frame losses at frames 67 and 70.
of the difference between the actual perframe PSNR and the
predicted value, both in dB. We consider prediction length of
8 (blue dashed lines) and of 5 (red solid lines). Fig. 4(a) is
for single frame losses. Fig. 4(b) is for multiple frame losses,
and in particular we consider two frame losses with a gap of
2 frames in between. We also calculate the mean value of the
absolute prediction error. For single frame losses, it is 0.66dB
and 0.51dB for prediction lengths 8 and 5, respectively. For
multiple frame losses, it is 0.60dB and 0.46dB for prediction
lengths 8 and 5, respectively.
0 1 2 3 40
0.2
0.4
0.6
0.8
1
Absolute Per−frame PSNR Prediction Error (dB)
CD
F
λ=8
λ=5
0 1 2 3 40
0.2
0.4
0.6
0.8
1
Absolute Per−frame PSNR Prediction Error (dB)
CD
F
λ=8
λ=5
(a) (b)
Fig. 4. The CDF of perframe PSNR prediction error for
(a) single frame losses; (b) two frame losses with a gap of
2 frames in between.
4. CONCLUSION
We propose a QoE prediction scheme that allows a commu
nication network to optimize resource allocation for video
teleconferencing. By using QoE models that take the per
frame PSNR time series as the input, the QoE prediction
problem reduces to a perframe PSNR prediction problem.
Simulation results show that the proposed perframe PSNR
prediction method achieves an average prediction error well
below 1dB.
Acknowledgement The authors would like to thank Dr.
Zhifeng Chen, Dr. Rahul Vanam, and Dr. Yuriy Reznik of
InterDigital for their help and insightful discussion.
5. REFERENCES
[1] C. Keimel, T. Oelbaum, and K. Diepold, “Improving the
prediction accuracy of video quality metrics,” in IEEE
International Conference on Acoustics, Speech, and Sig
nal Processing (ICASSP), Dallas, Texas, March 2010,
pp. 2442–2445.
[2] C. Keimel, M. Rothbucher, H. Shen, and K. Diepold,
“Video is a cube: Multidimensional analysis and video
quality metrics,” IEEE Signal Processing Magzine, pp.
41–49, Nov. 2011.
[3] K. Stuhlmuller, N. Farber, M. Link, and B. Girod,
“Analysis of video transmission over lossy channels,”
IEEE Journal on Selected Areas in Communications,
vol. 18, no. 6, pp. 1012–1032, June 2000.
[4] U. Dani, Z. He, and H. Xiong, “Transmission
distortion modeling for wireless video communica
tion,” in IEEE Global Telecommunications Conference
(GLOBECOM), Dec 2005.
[5] Y. J. Liang, J. G. Apostolopoulos, and B. Girod, “Anal
ysis of packet loss for compressed video: Effect of burst
losses and correlation between error frames,” IEEE
Trans. Circuits and Systems for Video Technology, vol.
18, no. 7, pp. 861–874, July 2008.
[6] Z. Chen and D. Wu, “Prediction of transmission distor
tion for wireless video communication: Algorithm and
application,” Journal of Visual Communication and Im
age Representation, vol. 21, no. 8, pp. 948–964, Nov.
2010.
[7] Subjective Video Quality Assessment Methods for Mul
timedia Applications, ITUT RecommendationP.910,
Sep. 1999.
[8] M. Venkataraman and M. Chatterjee, “Inferring video
QoE in real time,” IEEE Network, pp. 4–13, Jan./Feb.
2011.
[9] Opinion Model for VideoTelephony Applications, ITU
T Recommendation G.1070, 2007.
[10] Jose Joskowicz and J. Carlos Lpez Ardao, “Enhance
ments to the opinion model for videotelephony appli
cations,” in Proceedings of the 5th International Latin
American Networking Conference (LANC), 2009, pp.
87–94.
[11] M. Pinson and S. Wolf, “A new standardized method
for objectively measuring video quality,,” IEEE Trans.
Broadcasting, vol. 50, no. 3, pp. 312322, Sep. 2004.
[12] Z. Wang, L. Lu, and A. C. Bovik, “video quality as
sessment based on structural distortion measurement,”
Signal Processing: Image Communication, vol. 19, no.
2, pp. 121–132, Feb 2004.
[13] ITU, “H.264/AVC reference software,” Online, Oct
2012, iphome.hhi.de/suehring/tml/download/.
[14] Zhihai He, Jianfei Cai, and Chang Wen Chen, “Joint
source channel ratedistortion analysis for adaptive
mode selection and rate control in wireless video cod
ing,” IEEE Trans. Circuits and Systems for Video Tech
nology, vol. 12, no. 6, pp. 511–523, June 2002.