The Vault

Error-resilient video coding for wireless video telephony applications
Research Paper / Jan 2012

Error-resilient video coding for wireless video telephony
applications

Rahul Vanam and Yuriy Reznik,

InterDigital Communications, LLC, 9710 Scranton Road, San Diego, CA 92121 USA

ABSTRACT

In this paper, we present an error resilient video coding scheme for wireless video telephony applications
that uses feedback to limit error propagation. In conventional feedback-based error resilient schemes, error
propagation can significantly degrade visual quality when feedback delay is in the order of a few seconds.
We propose a coding structure based on multiple description coding that mitigates error propagation during
feedback delay, and uses feedback to adapt its coding structure to effectively limit error propagation. We
demonstrate the effectiveness of our approach at different error rates when compared to conventional coding
schemes that use feedback.

Keywords: E rror resilience, error concealment, video coding, RTCP feedback, mobile video telephony.

1. INTRODUCTION

Thanks to the advances in wireless networks and improvements in processing and graphics capabilities of
mobile devices, mobile video telephony is now becoming a part of our daily lives.1 Yet, some technical
challenges in the design of mobile video phones still exist. One such a challenge is the lossy nature of
wireless networks, as well as other communication links connecting one user to the other.

We provide a simple illustrative example of such a system in Figure 1. In this case, video from user A
is sent to user B using the RTP transport and RTCP control protocol.2 Packet loss could occur either
at the local link between the phone (UE) and the base station (eNB), in the Internet, or at the remote
wireless link. This loss is eventually noticed by user B’s application, and information about packet loss can
be communicated back to user A by means of an RTCP receiver report (RR).2,3 However, receiver reports
are sent only periodically, usually once in every 1-5 second interval, as they should not generate a significant
amount of traffic by themselves.2 Hence, by the time a sender knows that the receiver did not receive some
video packets, it is too late to retransmit them. Instead, the sender is usually instructed to send an I- or
IDR-frame to stop error propagation caused by lost packets. Additionally, in order to reduce visual artifacts
caused by lost packets in periods between receiver reports, the sender must employ video coding techniques
that are resilient to packet loss.

In this paper we offer a brief review of several existing approaches for error resilient video coding and pro-
pose a new approach, which is customarily designed to accommodate long notification delays in RTP/RTCP
- based systems.

1.1 Prior art

The problem of error resilient video coding is well known, and prior research has produced a number of
practical techniques for solving it. Recent surveys of such algorithms can be found in.4–6 Below we list few
general classes of such techniques.

Conventional methods for reducing error propagation. Random intra macroblock insertions, intra
slices, and slice interleaving – are among best known practical techniques for error resiliency.4 Such schemes

Contact information:
R.V.: E-mail: rahul.vanam@interdigital.com
Y.R.: E-mail: yuriy.reznik@interdigital.com



GWeNBUE
Wireless

link
GW eNB UE

Wireless

link
Internet

User A User B

RTCP RR: 1-5 sec

Figure 1. Mobile video communication system employing RTP transport and RTCP feedback. Transmission path
includes the local wireless link, base station (eNB), gateway (GW), and the internet.

break the coding dependency of macroblocks or slices in consecutive video frames, thereby limiting error
propagation.7 A recursive optimal per pixel estimate (ROPE) algorithm8 estimates the overall distortion
due to quantization, error propagation, and error concealment, and uses rate-distortion optimization to
choose the best intra or inter mode for each macroblock. Stockhammer et al.9 describe a multidecoder
distortion estimation method that improves error resilience. Both methods8 and9 show good performance,
but require high computational complexity. All these schemes assume no feedback, and offer better resiliency
of encoded video at the expense of a moderate increase in the bitrate.

Feedback-based schemes. If feedback is available, it can be used to direct the video encoder to either
encode the next frame as an IDR/I-frame, or encode using the most recent correctly transmitted frame as
the reference. The former approach is called an Intra refresh and latter is called reference picture selection
(RPS).10 Feedback-based methods may also be combined with using hierarchical P-frame coding structures,
as in such cases it is sufficient to fix frames that belong to the “base layer”.11 Most such techniques are only
effective when the notification delay is relatively small (on the order of 100s of milliseconds). The longer
the notification delay, the longer the part of video sequence that is affected by the error. In practice, video
decoders usually employ error concealment techniques, but even with state-of-art concealment, 1-5 seconds
of delay before refresh can cause significant and visible artifacts (so-called “ghosting”).

Multiple-description coding (MDC)- based schemes. MDC encoders produce several descriptions
(subsets of packets), such that reception of any description is sufficient for meaningful reconstruction of
video. The more descriptions that are received, the higher the quality of the reconstruction.12 Simple
examples of techniques in this class include temporal-, or spatial sub-sampling of the original video and
coding of each sample set as a separate video stream. A survey and classification of MDC-based video
coding schemes can be found in.13

Feedback-based schemes for MDC. Several feedback-based techniques have been proposed for correcting
errors in MDC-encoded video. These include: (a) RPS,14,15 (b) error concealment,15 and (c) retransmission
with fast decoding.15 In the RPS method, the sender on receiving a loss notification predicts the next
frame from a correctly transmitted frame,14,15 and in addition may also use correctly received portions of
the corrupted reference frame.15 In the error concealment method, the encoder on receiving feedback, error
conceals the frame in error, and uses it to predict future frames. This approach requires the encoder to know
the error concealment used at the decoder (which usually is not the case in practice). The retransmission
approach15 is very similar to,11 except that it uses an MDC structure. All these techniques, however, were
proposed for systems with very short notification delay (1-2 frames),15 and don’t seem to be practical in
cases when this delay is long.

1.2 Contributions

In this paper, we propose a novel approach, which we call Inhomogeneous Temporal Multiple Description
Coding (IHTMDC) for video. We consider long feedback delay in the design of our approach, which has
not been considered by most prior methods. Our approach lowers error propagation distortion when waiting
for the feedback, and on receiving it adapts its coding structure to limit error propagation. We call this
adaptation mechanism Cross-Description RPS (CDRPS), and show that in the presence of long feedback
delay it is more efficient than existing RPS-based methods.14,15 In the experimental section, we compare



(a)

(b)

Figure 2. (a) Conventional “IPPP” coding structure, and (b) Homogeneous temporal MDC.

Figure 3. Inhomogeneous Temporal Multiple Description (IMHDC) coding structure with interleaving factor k = 4.

different coding structures at different packet error rates, and show that our approach has better performance
over conventional methods at higher error rates.

1.3 Outline

The remainder of this paper is organized as follows. In Section 2, we describe our approach. Details of our
experiments and results are provided in Section 3. Conclusions and outlook for future work are given in
Section 4.

2. DESCRIPTION OF THE PROPOSED SCHEME

In this section, we first describe the coding structure of conventional video codec and its generalization
to temporal MDC. We then describe our proposed scheme, and show its relation to conventional (single
description) and multiple description schemes. We also describe mechanisms for adaptation of this scheme
using delayed feedback, allowing it to limit error propagation.

2.1 Conventional and MDC structures

We show the coding structure employed by the majority of today’s real-time video codecs in Figure 2(a). It
consists of an Intra- or IDR- frame followed by temporally predicted P-frames. It is commonly referred to as
“IPPP” structure. The disadvantage of this scheme is a continuous chain of dependencies between frames
and its susceptibility to error propagation.

One way to break this dependency is to create two or more sub-sequences of frames, which are not cross-
referencing each other. We illustrate this approach in Figure 2(b), where we use two uniformly sampled
sub-sequences to produce two independent encodings or descriptions of video. This is a very simple example
of an MDC scheme for video, which we will call homogeneous temporal MDC (HMDC).

2.2 Inhomogeneous temporal MDC

We now propose a modification of temporal MDC method, where the temporal distances between adjacent
frames in each description are not equal. We call this approach Inhomogeneous Temporal MDC (IHTMDC),
and we illustrate it with an example in Figure 3. In this figure, frames i and (i+1) are set five frames apart,
while frames (i+1) and (i+2) are set one frame apart.

Our motivation for using this scheme is to maintain the correlation between frames to a large extent
while generating descriptions, which results in a hybrid coding structure shown in Figure 3.



2.2.1 Connection to a single description and HMDC

We characterize IHTMDC by an interleaving interval k. In our example in Figure 3, this factor is set to
k = 4. Different coding structures can be derived from the IHTMDC by varying k. For example, when
k = 1, IHTMDC turns into a homogeneous temporal MDC scheme, shown in Figure 2(b). Similarly, if we
set k = ∞, IHTMDC effectively becomes a single description IPPP coding structure as shown in Figure 2(a).
2.2.2 Effects of packet loss

In the IPPP coding structure, a packet loss would corrupt all successive frames. On the other hand, in
HTMDC, the error propagates through one of the descriptions as illustrated in Figure 4 (a). Moreover,
HTMDC structure allows the decoder to better conceal successive frames of a corrupted description by
using neighboring frames belonging to uncorrupted description, thereby limiting error propagation drift to
at most k consecutive frames.

2.2.3 Effects of interleave factor k on overall distortion

When considering transmission over a lossy channel, the overall (end-to-end) distortion of received video can
be approximately expressed as:

DETE(k) ≈ DQ(k) +DT(k), (1)
where DETE, DQ, and DT denote the end-to-end-, source coding-, and transmission- induced distortions,
respectively. Specific conditions under which (1) holds true and related discussion can be found in.16

Assuming that (1) holds true, we may conjecture that for a given source and a given channel there may
exist an optimal choice of parameter k for our proposed coding scheme:

k∗ = arg min
k∈Z+

DETE(k). (2)

Intuitively, with no transmission errors, single description (k = ∞) is most desirable since it yields least
coding distortion. However, in the presence of packet loss, k = ∞ may not be a good choice since error
propagates through the length of the video yielding larger DETE. Using smaller k would increase DQ, but
it would also make the bitstream less sensitive to transmission errors, as errors propagates through one of
the descriptions, thereby resulting in smaller DETE. Therefore, the optimal choice of k must depend on the
packet error rate.

In order to test this theory, in Section 3, we will perform experiments to study the effect of k on the
rate-distortion performance under different packet error rates.

2.3 Adapting IHTMDC in response to feedback

We will now discuss uses of RTCP feedback for limiting error propagation in IHTMDC-coded video. There
are at least two possible solutions:

• Intra refresh: Encode the next frame belonging to the corrupted description as an IDR/I-frame as
illustrated in Figure 4 (b).

• Cross description RPS (CDRPS): In this approach, the encoder based on rate-distortion optimization
decides whether to encode the next frame belonging to the corrupted description as an intra/IDR
frame, or encode it using the nearest frame from the uncorrupted description as the reference. The
latter approach is illustrated in Figure 4 (c).

Performing an intra refresh or CDRPS on the next corrupted frame limits error propagation of the
corrupted description. When k = ∞, the above two methods turn into conventional single description intra
refresh and RPS, respectively. However, when k is finite, the CDRPS method is different compared to
traditional RPS techniques.10 In traditional RPS schemes, the reference is always set to last frame that was
confirmed as delivered. With long 1-5 second feedback, this means that such reference would have to be



Figure 4. IHTMDC subjected to errors. (a) Error propagation in IHTMDC without feedback. IHTMDC with feedback
when using (b) intra refresh, and (c) cross description reference picture selection.

25-100 frames back. On the other hand, with IHTMDC and one surviving description - such a reference can
always be found within last k frames. This makes this scheme much more suitable for systems with delayed
feedback.

In practical implementations, the CDRPS and intra-refresh techniques can be used in a complementary
fashion. For example, when encoder knows that both descriptions have been lost since last feedback, it may
insert a new IDR frame in one description, and use cross-description reference in another description to
restart the encoding process.

2.4 Opportunistic error concealment

In our IHTMDC approach, when a packet is lost, successive frames belonging to the lost packet’s description
are corrupted due to error propagation as illustrated in Figure 5(a). Although half the descriptions are
uncorrupted, error propagation can sometimes cause flickering due to the display of alternating corrupted
and uncorrupted descriptions, which can lower the overall visual quality. To mitigate this problem we present
an opportunistic error concealment method for our IHTMDC scheme.

The decoder on detecting a lost packet conceals it using a conventional error concealment method, such
as frame copy. For the next uncorrupted description, the decoder samples the first frame labeled as ‘x’ in
Figure 41 (a), and repeats it for the entire length of the description as shown in Figure 41 (b). For the next
description, it uses the last frame from the previous description, labeled ‘y’ in Figure 41 (a), and repeats it
over the entire length of the description as illustrated in Figure 41 (b). This error concealment procedure



Figure 5. Opportunistic error concealment method for IHTMDC scheme. (a) After a packet is lost, successive frames
belonging to the lost packet’s description are corrupted. (b) The opportunistic error concealment method sample
and holds the uncorrupted frames over the entire length of the description. For example, frames ‘x’ and ‘y’ from the
uncorrupted description are repeated over the entire length of the description.

is repeated over a period equal to the RTCP feedback delay. Although our concealment lowers the frame
rate, the resulting video visually appears to be almost-smooth since the sampled frames are temporally close.
Therefore, our error concealment improves overall visual quality and has low computational complexity.

3. EXPERIMENTAL RESULTS

In this section, we describe our experimental setup and results.

3.1 Experiment setup

In our tests we have utilized standard CIF and high-definition test sequences,17 and looped them back and
forth to generate 1000 frames for each test. We used “Foreman”, “Soccer”, and “News” for CIF sequences
(352 × 288, 30 fps), and “Pedestrian” for HD sequence (1080p, 25 fps). We have generated IHTMDC
bitstreams using a modification of the x264 encoder.18 In,19 the x264 encoder was compared with the H.264
JM reference encoder and was shown to be 50 times faster while providing bit rates within 5% for the same
PSNR. Constant QP rate control option and one reference frame was used in all our experiments. We used
the H.264 JM decoder with frame-copy error concealment method enabled.

For CIF sequences, we set QP = 26, 28, 30, 32, and 34, and use a frame as a slice. Here a lost packet
corresponds to a lost frame. For the 1080p sequence, we set QP = 30, 34, 38, and 42, and encode using 14
slices per frame, as this was necessary to keep the NAL unit size within 1400 bytes for our operating bitrates.
In order to understand the effectiveness of the proposed methods we have setup an experiment in which we
have simulated a channel with no errors, 10−2 and 3× 10−2 packet error rates (PER), which are typical for
conversational services over LTE. We have also implemented RTCP notification with a one second delay. We
have tested IHTMDC with interleaving factors k = 1, 2, 4, as well as conventional H.264 single-description
coding scheme (k = ∞). RPS technique (CDRPS in case of IHTMDC) was used to correct errors upon
RTCP notification.

3.2 Results

Table 1 illustrates visual quality achievable with single description coding (k = ∞) vs. IHTMDC with
k = 4. Sequence “Pedestrian.yuv” is used in this experiment. The error starts at frame number 166 for both
schemes. As expected, single description scheme (k = ∞) propagates error into frame 186, while in the case
of IHTMDC (k = 4) the error is not noticeable in frames 176 and 186.



Foreman.yuv

(a)

200 300 400 500 600 700 800 900
28

30

32

34

36

38

Bitrate (kb/s)

PS
N

R
(dB

)

No error





k = 1
k = 2
k = 4
k = ∞

(b)

200 300 400 500 600 700 800 900
28

30

32

34

36

38

Bitrate (kb/s)

PS
N

R
(dB

)

PER = 10−2





k = 1
k = 2
k = 4
k = ∞

(c)

200 300 400 500 600 700 800 900
28

30

32

34

36

38

Bitrate (kb/s)

PS
N

R
(dB

)

PER = 3 × 10−2





k = 1
k = 2
k = 4
k = ∞

Soccer.yuv

(d)

300 400 500 600 700 800 900

26

28

30

32

34

36

38

Bitrate (kb/s)

PS
N

R
(dB

)

No error





k = 1
k = 2
k = 4
k = ∞

(e)

300 400 500 600 700 800 900

26

28

30

32

34

36

38

Bitrate (kb/s)

PS
N

R
(dB

)

PER = 10−2





k = 1
k = 2
k = 4
k = ∞

(f)

300 400 500 600 700 800 900

26

28

30

32

34

36

38

Bitrate (kb/s)

PS
N

R
(dB

)

PER = 3 × 10−2





k = 1
k = 2
k = 4
k = ∞

News.yuv

(g)

100 150 200 250 300 350 400
32

33

34

35

36

37

38

39

40

Bitrate (kb/s)

PS
N

R
(dB

)

No error





k = 1
k = 2
k = 4
k = ∞

(h)

100 150 200 250 300 350 400
32

33

34

35

36

37

38

39

40

Bitrate (kb/s)

PS
N

R
(dB

)

PER = 10−2





k = 1
k = 2
k = 4
k = ∞

(i)

100 150 200 250 300 350 400
32

33

34

35

36

37

38

39

40

Bitrate (kb/s)
PS

N
R

(dB
)

PER = 3 × 10−2





k = 1
k = 2
k = 4
k = ∞

Figure 6. Rate-distortion performance of IHTMDC for different interleaving factors k, packet error rates, and frame
resolutions. Plots for “Foreman.yuv” : (a) no errors, (b) PER = 10−2, and (c) PER = 3×10−2. Plots for “Soccer.yuv”:
(d) no error, (e) PER = 10−2, and (f) PER = 3 × 10−2. Plots for “News.yuv”: (d) no error, (e) PER = 10−2, and
(f) PER = 3 × 10−2. Cases when k = ∞ and k = 1 correspond to the single description scheme and homogeneous
temporal MDC, respectively.

Figures 6 (a)-(i) show the rate-distortion performance of IHTMDC with CDRPS for different packet error
rates and values of k for CIF sequences. As expected, the RD performance of the single description scheme
(k = ∞) performs the best for the no-error case as shown in Figure 6 (a), (d), and (e). With packet loss,
IHTMDC and HMDC show better performance over the single description scheme for the “Foreman” and
“Soccer” sequences as shown in Figures 6 (b), (c), (e), and (f). For the “News” sequence at PER = 10−2,
single description (k = ∞) performs the best at bitrates less than 250 kb/s, and k = 4 performs the best
at higher bitrates as shown in Figure 6 (h). This clearly indicates that the choice of k is also dependent
on the video content and operating bitrate. For PER = 3 × 10−2, k = 4 shows the best R-D performance.
With packet loss, HMDC (k = 1) shows poor performance over the single description scheme as shown in
Figures 6 (h) and (i).

For the HD “Pedestrian” sequence, we only test for PER = 10−3 and 2 × 10−3, since our IHTMDC
scheme at k = 4 demonstrates good performance at such low packet error rates. Specifically, for PER =
10−3, k = 4 has similar performance to single description (k = ∞) for bitrates less than 1.4 Mb/s, and
has best performance for higher bitrates, yielding up to 0.7 dB gain over single description as shown in
Figure 7 (b). For PER = 2 × 10−2, k = 4 has the best performance yielding up to 1.5 dB gain over single
description as shown in Figure 7 (c).



frame# 166 frame# 176 frame# 186

k = ∞

k = 4

Table 1. Illustration of error propagation in single-description coding vs. IHTMDC using CDRPS with one second
feedback delay using “Pedestrian.yuv” sequence. Error occurs at frame number 166. The red ellipses highlight errors.
For single description (k = ∞), error propagates all the way until frame number 186, while in the IHTMDC case
(k = 4) error propagation is not noticeable in frames 176 and 186.

Pedestrian.yuv

(a)

1 1.5 2 2.5 3
33

34

35

36

37

38

39

Bitrate (Mb/s)

PS
N

R
(dB

)

No error





k = 1
k = 2
k = 4
k = ∞

(b)

1 1.5 2 2.5 3
33

34

35

36

37

38

39

Bitrate (Mb/s)

PS
N

R
(dB

)

PER = 10−3





k = 1
k = 2
k = 4
k = ∞

(c)

1 1.5 2 2.5 3
33

34

35

36

37

38

39

Bitrate (Mb/s)

PS
N

R
(dB

)

PER = 2 × 10−3





k = 1
k = 2
k = 4
k = ∞

Figure 7. Rate-distortion performance of IHTMDC for different interleaving factors k and packet error rates for
“Pedestrian.yuv” sequence: (a) no error, (b) PER = 10−3, and (c) PER = 2× 10−3.

4. CONCLUSIONS AND FUTURE WORK

In this paper, we have presented an inhomogeneous multiple description video coding technique that provides
excellent error resilience properties, and is suitable for systems with long feedback delay. Our scheme
effectively uses feedback to reset the coding structure, thereby limiting error propagation. Our scheme can be
used to derive different coding structures by varying the interleaving factor. We compare our approach with
single description coding and homogeneous temporal multiple description coding with feedback at different
packet error rates, and find that our scheme provides better visual quality and rate-distortion performance
at higher error rates. In our current work, we studied our approach using a fixed interleaving factor. In
future work, we plan to use an approach whose interleaving factor dynamically adapts with observed packet
error rates.

REFERENCES

[1] T. Weigand and G.J. Sullivan, “The picturephone is here. Really,” IEEE Spectrum, vol. 48, no. 9, pp.
50–54, Sept. 2011.

[2] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RFC 3550: RTP: A transport protocol for
real-time applications,” July 2003.

[3] J. Ott, S.Wenger, N.Sato, C.Burmeister, and J. Ray, “IETF RFC 4585: Extended RTP profile for
real-time transport control protocol (RTCP)-based feedback (RTP/AVPF),” 2006.



[4] Y. Wang, S. Wenger, J. Wen, and A. K. Katsaggelos, “Review of error resilient coding techniques for
real-time video communications,” IEEE Signal Proc. Magazine, vol. 17, pp. 61–82, 2000.

[5] Y. Wang and Q-F. Zhu, “Error control and concealment for video communication – a review,” in
Proceedings of the IEEE, 1998, pp. 974–997.

[6] S. Kumar, L. Xu, M. K. Mandal, and S. Panchanathan, “Error resiliency schemes in H.264/AVC
standard,” J. Visual Communication and Image Representation, vol. 17, no. 2, pp. 425–450, 2006.

[7] T. Stockhammer, “Error robust macroblock mode and reference frame selection,” in VCEG JVT-B102,
Jan 2002.

[8] R. Zhang, S. L. Regunathan, and K. Rose, “Video coding with optimal inter/intra-mode switching for
packet loss resilience,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 6, pp. 966–976,
2000.

[9] T. Stockhammer, M. M. Hannuksela, and T. Wiegand, “H.264/AVC in wireless environments,” IEEE
Trans. Cir. and Sys. for Video Technol., vol. 13, pp. 657–673, 2003.

[10] B. Girod and N. Fa¨rber, “Feedback-based error control for mobile video transmission,” in Proceedings
of the IEEE, 1999, pp. 1707–1723.

[11] I. Rhee and S. R. Joshi, “Error recovery for interactive video transmission over the internet,” IEEE
Journal on Selected Areas in Communications, vol. 18, pp. 1033–1049, 2000.

[12] V. K. Goyal, “Multiple description coding: Compression meets the network,” IEEE Signal Processing
magazine, vol. 18, no. 5, pp. 74 – 93, Sept 2001.

[13] Y. Wang, A. R. Reibman, and S. Lin, “Multiple description coding for video delivery,” Proceedings of
the IEEE, vol. 93, no. 1, pp. 57–70, 2005.

[14] S. Fukunaga, T. Nakai, and H. Inoue, “Error resilient video coding by dynamic replacing of reference
pictures,” in IEEE GLOBECOM 1996, 1996, vol. 3, pp. 1503 – 1508.

[15] W. Tu and E. G. Steinbach, “Proxy-based reference picture selection for error resilient conversational
video in mobile networks,” IEEE Trans. Cir. and Sys. for Video Technol., vol. 19, no. 2, pp. 151–164,
Feb 2009.

[16] Z. Chen and D. Wu, “Rate-distortion optimized cross-layer rate control in wireless video communica-
tion,” IEEE Trans. Cir. Sys. Video Tech., vol. 22, no. 3, pp. 352–365, March 2012.

[17] “Raw video sequences,” ftp.ldv.e-technik.tu-muenchen.de.

[18] “x264 encoder,” http://www.videolan.org/developers/ x264.html.

[19] Loren Merritt and Rahul Vanam, “Improved rate control and motion estimation for H.264 encoder,”
in Proceedings of IEEE ICIP (5), 2007, pp. 309–312.