The Vault

Model Based QOE Prediction To Enable Better User Experience For Video Teleconferencing
Research Paper / Feb 2014

MODEL-BASED QOE PREDICTION TO ENABLE BETTER USER EXPERIENCE FOR VIDEO TELECONFERENCING Liangping Ma? Tianyi Xu† Gregory Sternberg‡ Anantharaman Balasubramanian? Ariela Zeira? ? InterDigital Communications, Inc., San Diego, CA 92121, USA † Dept. of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716, USA ‡InterDigital Communications, Inc., King of Prussia, PA 19406, USA ABSTRACT The ultimate goal of network resource allocation for video teleconferencing is to optimize the Quality of Experience (QoE) of the video. We consider the IPPP video coding structure with macroblock intra refresh, which is widely used for video teleconferencing. Generally, the loss of a current frame causes error propagation to subsequent frames due to the video coding structure. Therefore, to optimize the QoE, a communication network needs to be able to accurately predict the consequence of each of its resource allocation decisions. We propose a QoE prediction scheme by consid- ering QoE models that use the per-frame PSNR time series as the input, thus reducing the QoE prediction problem to a per-frame PSNR prediction problem. The QoE prediction scheme is jointly implemented by the video sender (or MCU) and the communication network. Simulation results show that the proposed per-frame PSNR prediction method is fairly accurate, with an average error well below 1dB. Index Terms— QoE, video, prediction, scheduling, net- work. 1. INTRODUCTION In real-time video applications such as video teleconferenc- ing, the IPPP video coding structure is widely used, where the first frame is an intra-coded frame, and each P frame uses the frame immediately preceding it as the reference for mo- tion compensated prediction. To meet the stringent delay re- quirement, the encoded video is typically delivered by the RTP/UDP protocol, which is lossy in nature. When a packet loss occurs, the associated video frame as well as the subse- quent frames will be affected, which is called error propaga- tion. Packet loss information can be fed back to the video sender or multipoint control unit (MCU) (which may per- form transcoding) via protocols such as RTP Control Protocol (RTCP) to trigger the insertion of an intra-coded frame to stop error propagation. However, the feedback delay is at least about a round trip time (RTT). To alleviate error propagation, additionally, macroblock intra refresh – encoding some mac- roblocks of each video frame in the intra mode – is often used. We realize that a video frame generally is mapped to one or multiple packets (or slices in the case of H.264/AVC) and thus a packet loss does not necessarily lead to the loss of a whole frame. In this paper, we focus on whole frame losses, and leave the more general packet losses for future work. Although there is no difference in the video coding scheme for the P frames, the impact of a frame loss can be drastically different from frame to frame. As an exam- ple, Fig. 1 shows the average loss in PSNR for the subse- quent frames if a P frame is dropped in the network for the Foreman-CIF sequence encoded in H.264/AVC with a quan- tization parameter (QP)= 30, where dropping frame 20 alone leads to a loss of 2.7dB, while dropping frame 22 alone leads to a loss of 5.9dB. This presents an opportunity for a commu- nication network to intelligently drop certain video packets in the event of network congestion to optimize the video quality. 0 20 40 60 80 1000 2 4 6 8 10 Frame Number Av er ag e Lo ss in P SN R (dB ) Fig. 1. The impact of a frame loss on the average PSNR of subsequent frames for the Foreman-CIF video sequence. In this paper, we focus on video Quality of Experience (QoE), and propose a QoE prediction scheme that is low in delay, computational complexity and communication over- head to enable a network to allocate network resources so as to optimize the QoE. Specifically, with such a scheme, the network knows the resulting QoE for each resource allocation decision (e.g., dropping certain frames in the network) so that it can do optimal resource allocation by choosing the decision corresponding to the best QoE. Because of the prediction na- ture of the problem and the fact that the prediction has to be done within a communication network, we need a QoE model that is feasible to compute, which motivates us to consider 2815978-1-4799-0356-6/13/$31.00 ©2013 IEEE ICASSP 2013 those (e.g., [1][2]) that use the per-frame PSNR time series as the input. A per-frame PSNR time series is a sequence of PSNR values, each indexed by its corresponding frame num- ber. The QoE prediction problem then reduces to one of pre- dicting the per-frame PSNR. The proposed QoE prediction scheme is jointly implemented by the video sender (or MCU) and the communication network. Simulation results show that the proposed per-frame PSNR prediction method can achieve an average error much less than 1dB. We briefly review related work on channel distortion, which is the challenge for predicting the per-frame PSNR. Some aspects of the channel distortion model in the seminal work [3] are adopted in our work. An additive exponential model proposed in [4] is shown to have good performance. However, the determination of the model requires some in- formation (the motion reference ratio) of the predicted video frames to be known a priori. This is possible only if the encoder generates all the video frames up to the predicted frame, introducing significant delay. For example, to predict the channel distortion 10 frames ahead, assuming a frame rate of 30 frames per second, the delay will be 333 ms. In [5], a model taking into account the cross-correlation among multi- ple frame losses is proposed for channel distortion. However, in the parameter estimation, the whole video sequence needs to be known in advance, making it infeasible for real time applications. Pixel-level channel distortion prediction models are proposed for optimizing the video encoder [6], which, although accurate, are an overkill for the problem we look at. Thus, in this paper, we consider the simpler frame-level distortion prediction. The remainder of the paper is organized as follows. Sec- tion 2 describes the QoE prediction scheme, Section 3 gives the simulation results on the per-frame PSNR prediction, and Section 4 concludes the paper. 2. QOE PREDICTION 2.1. Choosing QoE Models Subjective video quality testing is the ultimate method to measure the video quality perceived by the human visual system (HVS). However, subjective testing requires playing the video to a group of human subjects in stringent testing conditions [7] and collecting the ratings of the video quality, which is time consuming, expensive, and unable to provide realtime assessment results, not to mention predicting the video quality. Alternatively, QoE models can be constructed by relating QoS metrics to video QoE [1][2][8][9]. The ITU recommen- dation G.1070 [9] considers the packet loss rate rather than the packet loss pattern in modeling the QoE, which is insufficient as indicated by our example in Section 1 where the pattern is a single frame loss. The model in [8] has the same problem. The ITU recommendation G.1070 also requires extensive of- fline subjective testing to construct a large number of QoE models, and extracting certain video features (e.g., degree of motion) [10] during prediction for desired accuracy, making it unsuitable for real-time applications. The QoE model proposed in [1] uses the statics extracted from the per-frame peak signal-to-noise ratio (PSNR) time series, which are QoS metrics, as the model input. Some of the statistics are the minimum, maximum, standard devi- ation, the 90% and the 10% percentiles, and the difference in PSNR between two consecutive frames. Although the av- erage PSNR of a video sequence is generally considered a flawed video quality metric, the model in [1] is shown to outperform Video Quality Metric (VQM) [11] and Structural Similarity (SSIM) [12] in terms of correlation to subjective testing results. With the choice of such QoE models, the QoE prediction problem reduces to one that predicts the per-frame PSNR time series. 2.2. The Proposed QoE Prediction Approach Before discussing various approaches to QoE prediction, we note that the pattern of packet losses is important, because the video quality, or specifically statistics of the per-frame PSNR time series, depends on not only how many frame losses have occurred, but also where they have occurred in the video se- quence. There are three approaches to QoE prediction. In a sender-only approach, the per-frame PSNR time series for each frame loss pattern is obtained by simulation at the video sender. However, the number of frame loss patterns grows exponentially with the number of video frames. Even if the amount of computation is not an issue, the resulting per- frame PSNR time series need to be sent to the communication network, generating excessive communication overhead. In a network-only approach, the network decodes the video and finds out the channel distortion for different packet loss patterns. However, the video quality depends on not only the channel distortion, but also the distortion from source coding. Due to lack of access to the original video, it is im- possible for the network to know about the source distortion, making the QoE prediction inaccurate. Also, this approach becomes unscalable when the network serves a very large number of video teleconferencing sessions simultaneously. Finally, this approach may not be suitable when the video packets are encrypted. We propose a joint approach that involves both the video sender (or MCU) and the network. The video sender obtains the channel distortion for single frame losses, and passes the results along with the source distortion to the network. The network knows the QoE model, and for each resource alloca- tion decision, it calculates the total distortion for each frame (and hence the per-frame PSNR time series) by utilizing the linearity assumption for multiple frame losses [3][4]. This ap- proach eliminates virtually all the communication overhead in 2816 the sender-only approach, takes into account source distortion absent in the network-only approach, and does not need to do video encoding/decoding in the network. Į Ȗ Į Ȗ Video Encoder (delay by t1) D D Video Decoder (delay t2) D... Delay by t1+t2 + - Channel Distortion Model (delay by t3) Delay by t3 AnnotationF(n) G(n) delayed by t1 sec ds(n) delayed by (t1 + t2) sec d0(n), Į(n-m), Ȗ(n-m) Network m delay units F(n) delayed by (t1 + t2) sec Delay by t3-t2 Fig. 2. System architecture of the video sender. 2.3. Per-frame PSNR Prediction As mentioned before, frame-level channel distortion predic- tion is appropriate for the problem we focus on. We first look at the video sender side, whose architecture is shown in Fig. 2. Let the number of pixels in a frame be N . Let F (n), a vector of length N , be the nth original frame, and F (n, i) denote pixel i of F (n). Let Fˆ (n) be the reconstructed frame with- out frame loss corresponding to F (n), and Fˆ (n, i) be pixel i of Fˆ (n). Original video frame F (n) is fed into the video encoder, which generates packet G(n) after a delay of t1 sec- onds. Packet G(n) may represent multiple NAL units, which we call together a packet for convenience. Packet G(n) is then fed into the video decoder to generate the reconstructed frame Fˆ (n), which takes t2 seconds. Note that in a typical video encoder, this reconstruction is already in place. Let the distortion due to source coding for F (n) be ds(n). Then, ds(n) = N∑ i=1 (F (n, i)− Fˆ (n, i))2/N, (1) which is readily available at the video encoder. As mentioned earlier, the construction of the channel dis- tortion model in [4] requires some information (the motion reference ratio) of the predicted video frames to be known in advance, which results in significant delay. To address this problem, we propose using the current packet G(n) and the previously generated packetsG(n−1), ..., G(n−m) to train a channel distortion model. In Fig. 2, D represents a delay of an inter-frame time interval. The training takes t3 seconds. Note that t3 ≥ t2, because the Channel Distortion Model needs to decode at least one frame. The values of the parameters for the model are then sent to the Annotation block for an- notation. Also annotated is the source distortion ds(n). The annotated packet is then sent to the communication network. We now look at the details of the channel distortion model. Prior results show that a linearity model performs well in practice [3][4]. For each frame loss, we define func- tion h(k, l) [4], which models how much distortion the loss of frame k causes to frame l for l ≥ k h(k, l) = d0(k) e−α(k)(l−k) 1 + γ(k)(l − k) (2) where d0(k) is the channel distortion for frame k, resulting from the loss of frame k only and the error concealment, and α(k) and γ(k) are parameters dependent on frame k. In this paper, we consider a simple error concealment scheme, namely, frame copy. Hence the distortion due to the loss of frame k (and only frame k) is d0(k) = N∑ i=1 (Fˆ (k, i)− Fˆ (k − 1, i))2/N. (3) In (2), γ(k) is called leakage, which describes the ef- ficiency of loop filtering to remove the artifacts introduced by motion compensation and transformation [3]. The term e−α(k)(l−k) captures the error propagation in the case of pseudo-random macroblock intra refresh. In [3], a linear function (1 − (l − k)β), where β is the intra refresh rate, is proposed. We do not use this linear function, because the macroblock intra refresh scheme in [3] is cyclic, while the one used in our simulation software (JVT JM 16.2 [13]) is pseudo-random. The linear model states that the impact vanishes after 1/β frames (the intra refresh update interval for the cyclic scheme), which is not the case for the pseudo- random scheme as suggested by our simulation results. An exponential model such as the one in [4] is better. However, the model in [4] fails to capture the impact of loop filtering, while our model does capture it. The values of α(k) and γ(k) can be obtained by methods such as least squares or least absolute value via fitting simulation data. In Fig. 2, the video sender drops packet G(n − m) from the packet sequence G(n), G(n− 1), ..., G(n−m), performs video de- coding, measures the channel distortions, and finds the value of α(n−m) (defined as αˆ(n−m)) and the value of β(n−m) (defined as γˆ(n−m)) in (2) with the substitution k = n−m that minimize the error between the measured distortions and the predicted distortions. We next look at the network side. We assume that the net- work has packetsG(n), G(n−1), ..., G(n−L) available. Let I(k) be the indicator function, being 1 if frame k is dropped and 0 otherwise. A packet loss pattern can be characterized by a sequence of I(k)’s. For convenience we denote a pattern by a vector P := (I(n), I(n− 1), ..., I(0)). The channel dis- tortion of frame l ≥ n−L resulting from P is then predicted as dˆc(l, P ) = l∑ k=0 I(k)hˆ(k, l), (4) where the linearity assumption for multiple frame losses in 2817 [3][4] is used, and hˆ(k, l) = d0(k) e−αˆ(k−m)(l−k) 1 + γˆ(k −m)(l − k) . (5) We realize that the model in (4) can be improved, for exam- ple, by considering the cross-correlation of frame losses [5]. However, as mentioned earlier, the model in [5] is not suitable for real time applications, and its complexity is very high. The simple model in (4) proves to be reasonably accurate [3][4]. In order to predict the per-frame PSNR for a particular packet loss pattern P , the network needs to know about the source distortion as well. The total distortion prediction can be represented as dˆ(l, P ) = dˆc(l, P ) + dˆs(l), (6) where dˆs(l) = ds(l) for n ≥ l ≥ n − L, and dˆs(l) = ds(n) for l > n, and where we have applied the assumption that the channel distortion and the source distortion are independent, which is shown to be pretty accurate [14]. Note that the source distortion estimation dˆs(l) for n ≥ l ≥ n − L is precise and readily available at the video sender and is included in the annotation of the L+ 1 packets G(n), ..., G(n − L). The PSNR prediction for frame l ≥ n − L with packet loss pattern P is then P̂SNR(l, P ) = 10 log10(255 2/dˆ(l, P )). (7) The per-frame PSNR time series is then {P̂SNR(l, P )}, where l is the time index. The time series is a function of P . Thus, to generate the best time series, the network chooses the optimal P among those that are feasible under the resource constraint. Note that, part of P , i.e., I(n−L− 1), I(n−L− 2), ..., I(0), is already determined, because a frame between 0 and n − L − 1 has been either delivered or dropped. The variables subject to optimization are the remaining part of P , i.e., I(n − L), ..., I(n). We define the prediction length λ as the number of frames to be predicted. That is, if the nth frame is to be dropped, then the predictor predicts frames n through n + λ. Note that, it is not necessary to predict many frames, since it takes the video encoder not more than one RTT to receive feedback about a frame loss. 3. SIMULATION RESULTS We evaluate the performance of the proposed per-frame PSNR prediction method via simulation. We consider both single frame losses and multiple frame losses. The Foreman CIF video sequence is used. For m = 10, L = 5, and λ = 8, Fig. 3(a) shows the prediction for frames l ≥ 36 if frame 36 is dropped, and Fig. 3(b) for frames l ≥ 67 if frames 67 and 70 are dropped. More simulation results are shown in Fig. 4. We plot the cumulative distribution function (CDF) of the abso- lute per-frame PSNR prediction error, i.e., the absolute value 10 15 20 25 30 35 4024 26 28 30 32 34 36 38 Frame number PS NR (d B) Actual Predicted 20 30 40 50 60 7024 26 28 30 32 34 36 38 Frame number PS NR (d B) Actual Predicted (a) (b) Fig. 3. The per-frame PSNR prediction for (a) a single frame loss at frame 36; (b) two frame losses at frames 67 and 70. of the difference between the actual per-frame PSNR and the predicted value, both in dB. We consider prediction length of 8 (blue dashed lines) and of 5 (red solid lines). Fig. 4(a) is for single frame losses. Fig. 4(b) is for multiple frame losses, and in particular we consider two frame losses with a gap of 2 frames in between. We also calculate the mean value of the absolute prediction error. For single frame losses, it is 0.66dB and 0.51dB for prediction lengths 8 and 5, respectively. For multiple frame losses, it is 0.60dB and 0.46dB for prediction lengths 8 and 5, respectively. 0 1 2 3 40 0.2 0.4 0.6 0.8 1 Absolute Per−frame PSNR Prediction Error (dB) CD F λ=8 λ=5 0 1 2 3 40 0.2 0.4 0.6 0.8 1 Absolute Per−frame PSNR Prediction Error (dB) CD F λ=8 λ=5 (a) (b) Fig. 4. The CDF of per-frame PSNR prediction error for (a) single frame losses; (b) two frame losses with a gap of 2 frames in between. 4. CONCLUSION We propose a QoE prediction scheme that allows a commu- nication network to optimize resource allocation for video teleconferencing. By using QoE models that take the per- frame PSNR time series as the input, the QoE prediction problem reduces to a per-frame PSNR prediction problem. Simulation results show that the proposed per-frame PSNR prediction method achieves an average prediction error well below 1dB. Acknowledgement The authors would like to thank Dr. Zhifeng Chen, Dr. Rahul Vanam, and Dr. Yuriy Reznik of InterDigital for their help and insightful discussion. 2818 5. REFERENCES [1] C. Keimel, T. Oelbaum, and K. Diepold, “Improving the prediction accuracy of video quality metrics,” in IEEE International Conference on Acoustics, Speech, and Sig- nal Processing (ICASSP), Dallas, Texas, March 2010, pp. 2442–2445. [2] C. Keimel, M. Rothbucher, H. Shen, and K. Diepold, “Video is a cube: Multidimensional analysis and video quality metrics,” IEEE Signal Processing Magzine, pp. 41–49, Nov. 2011. [3] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, “Analysis of video transmission over lossy channels,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 6, pp. 1012–1032, June 2000. [4] U. Dani, Z. He, and H. Xiong, “Transmission distortion modeling for wireless video communica- tion,” in IEEE Global Telecommunications Conference (GLOBECOM), Dec 2005. [5] Y. J. Liang, J. G. Apostolopoulos, and B. Girod, “Anal- ysis of packet loss for compressed video: Effect of burst losses and correlation between error frames,” IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 7, pp. 861–874, July 2008. [6] Z. Chen and D. Wu, “Prediction of transmission distor- tion for wireless video communication: Algorithm and application,” Journal of Visual Communication and Im- age Representation, vol. 21, no. 8, pp. 948–964, Nov. 2010. [7] Subjective Video Quality Assessment Methods for Mul- timedia Applications, ITU-T Recommendation-P.910, Sep. 1999. [8] M. Venkataraman and M. Chatterjee, “Inferring video QoE in real time,” IEEE Network, pp. 4–13, Jan./Feb. 2011. [9] Opinion Model for Video-Telephony Applications, ITU- T Recommendation G.1070, 2007. [10] Jose Joskowicz and J. Carlos Lpez Ardao, “Enhance- ments to the opinion model for video-telephony appli- cations,” in Proceedings of the 5th International Latin American Networking Conference (LANC), 2009, pp. 87–94. [11] M. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,,” IEEE Trans. Broadcasting, vol. 50, no. 3, pp. 312322, Sep. 2004. [12] Z. Wang, L. Lu, and A. C. Bovik, “video quality as- sessment based on structural distortion measurement,” Signal Processing: Image Communication, vol. 19, no. 2, pp. 121–132, Feb 2004. [13] ITU, “H.264/AVC reference software,” Online, Oct 2012, iphome.hhi.de/suehring/tml/download/. [14] Zhihai He, Jianfei Cai, and Chang Wen Chen, “Joint source channel rate-distortion analysis for adaptive mode selection and rate control in wireless video cod- ing,” IEEE Trans. Circuits and Systems for Video Tech- nology, vol. 12, no. 6, pp. 511–523, June 2002. 2819