Fast transforms for Intrapredictionbased image
and video coding
Ankur Saxena, Felix C. Fernandes
Samsung Telecommunications America
1301 E. Lookout Dr, Richardson, TX 75082, USA
Email: {asaxena,fernandes}@sta.samsung.com
Yuriy A. Reznik
InterDigital Communications
9710 Scranton Rd, San Diego, CA 92121, USA
Email: yreznik@ieee.org
Abstract
In this paper, we provide an overview of the DCT/DST transform scheme for intra coding in the HEVC
standard. A unique feature of this scheme is the use of DSTVII transforms in addition to DCTII. We further
derive factorizations for fast joint computation of DCTII and DSTVII transforms of several sizes. Simulation
results for the DCT/DST scheme in the HM reference software for HEVC are also provided together with a
discussion on computational complexity.
I. INTRODUCTION
Most image and videocoding standards such as JPEG, MPEG2, MPEG4 AVC/H.264, employ block
based transform coding as a framework for efﬁcient image and video compression. The pixel domain data is
transformed to frequency domain using a transform process on a blockbyblock basis. For typical images,
most of the energy is concentrated in lowfrequency transform coefﬁcients. Following the transform, a
relatively large stepsize quantizer can be used for higherfrequency transform coefﬁcients in order to
compact energy more efﬁciently and to attain better compression. Hence it is required to devise the
optimal transform for each image block to fully decorrelate the transform coefﬁcients. The KarhunenLoeve
Transform (KLT) is known to be optimal under certain conditions for transform coefﬁcients. However,
practical use of KLT is limited due to its high computational complexity. It is well known that Discrete
Cosine Transform of Type II (DCTII) provides an attractive alternative to KLT in terms of decorrelation
and energy compaction [3] and performance close to KLT. But with the advent of intra prediction, this is
no longer the case and the optimal transform should adapt to the intra prediction mode.
Recently, Han, Saxena & Rose [7] analytically derived Discrete Sine Transform of TypeVII (DSTVII)
with frequency and phase components different from the conventional DCT 1to be the actual KLT along the
prediction direction for intra modes in H.264/AVC. They also showed that if prediction is not performed
along a particular direction, then DCT performs close to KLT. The idea was applied to the vertical
and horizontal modes in intraprediction in H.264/AVC and a combination of the DST and conventional
DCT was used for the vertical and horizontal modes. Saxena & Fernandes in [13], [15] showed that
a combination of DCTII and DSTVII is indeed the optimal transform for all the oblique modes in
uniﬁed intra prediction [9] in the HEVC standardization. The DCT/DST transform scheme described in
this paper was also adopted in the HEVC standardization for block size 4x4 for Intra Luma blocks [17].
Recently, a simpliﬁcation of the DCT/DST technology [18] was adopted in HEVC which removes the
modedependency of whether to select between DCT or DST as the horizontal (and respectively vertical)
transform, and DST was always used for the 4x4 Intra Luma blocks.
In the literature, until recently there were no wellknown algorithms for ﬁnding fast implementations
of the DSTVII similar to various fast implementations for DCT and DST Types I to IV. Recently,
Chivukula and Reznik have presented a technique for ﬁnding fast implementations for DSTVII and
DSTVI (transpose of DSTVII) in [5], where they show that the 4x4 DST transform matrix can be
1In the remainder of the paper, unless otherwise speciﬁed, DCT and DST are referred to as DCTII and DSTVII respectively.
2013 Data Compression Conference
10680314/13 $26.00 © 2013 IEEE
DOI 10.1109/DCC.2013.9
13
��� ��� ���
Fig. 1. Examples of Category 1 oblique modes are shown in (a) and (b) where prediction is performed from one direction only: either the
top row or left column. (c) shows example of Category 2 oblique mode where prediction is performed from both directions top row and left
column.
(a) (b)
Fig. 2. (a) Vertical Intra prediction mode, and (b) DCT and DST basis vectors. For DCT, the ﬁrst basis vector is ﬂat. For DST, the ﬁrst
basis vector is zero at one boundary, and maximum at the other boundary. This mimics the prediction error residue behavior for vertical
mode more efﬁciently as compared to the DCT basis vector.
implemented using 5 multiplications. Note that, by contrast the 4x4 DCT in HM, the Test Model for
HEVC standardization is implemented via 6 multiplications. In this paper, we extend study of factorization
techniques started in [5], and show that for certain sizes DCTII and DSTVII transforms allow joint
factorizations. Such embedded joint factorizations can be of interest for efﬁcient hardware implementations.
The rest of the paper is organized as follows: Sec. II outlines the details of uniﬁed intra prediction, and
the discussion about the DCT/DST transform scheme in the HEVC standardization along with experimental
results. Sec. III discusses fast factorization of DCTII and DSTVII transforms, followed by Conclusions
in Sec. IV.
II. INTRAPREDICTION CODING MODES AND TRANSFORMS IN HEVC STANDARD
A. Review of Uniﬁed Intra Prediction
The ongoing HEVC standardization uses uniﬁed intra prediction in which up to 35 different intra
prediction modes for the Luma component of the video at a particular block size can be deﬁned. In
general, these 35 intraprediction modes can be divided into 3 categories as follows:
1) Category 1 oblique modes (Fig. 1): Here prediction is performed from the decoded pixels from
either the top row or the left column. The vertical mode and the horizontal mode in [9] are special
cases of this oblique mode when prediction direction is vertical or horizontal respectively.
2) Category 2 oblique modes (Fig. 1): Here prediction is performed from both the top row and the left
column pixels.
3) DC and Planar modes: These are nondirectional modes, and can be used for ﬁtting ﬂat regions in
case of DC mode, or surface ﬁtting in case of Planar mode.
14
Tint
DSTVII
=
⎡
⎢⎢⎣
29 55 74 84
74 74 0 −74
84 −29 −74 55
55 −84 74 −29
⎤
⎥⎥⎦ ∼ 128
[
2√
2N + 1
sin
(
π(2i+ 1)(j + 1)
2N + 1
)]
i,j=0,...,N−1
N=4
Tint
DCTII
=
⎡
⎢⎢⎣
64 64 64 64
83 36 −36 −83
64 −64 −64 64
36 −83 83 −36
⎤
⎥⎥⎦ ∼ 128
[√
2
N
λ(i) cos
(
πi(2j + 1)
2N
)]
i,j=0,...,N−1
N=4
.
where : λ(i) =
[ 1√
2
, if i = 0
0, otherwise
Fig. 3. Integer transform matrices deﬁned in HEVC standard and their relationship to 4point DSTVII and DCTII transforms.
B. Transforms in HEVC for intra prediction modes
In the HEVC, standard, the DCT/DST transform scheme described was adopted in [17] for block size
4x4 for Intra Luma blocks. As an example, we brieﬂy describe why DST is the optimal transform rather
than DCT for some prediction modes. Speciﬁcally, consider the “Vertical” prediction mode in Fig. 2(a),
which is a special case of Category 1 Oblique modes. Here, after prediction from the top row pixel x0j (or
its reconstruction), the energy in the prediction residues xij (for j ∈ {1, N}, and for i ∈ {1, N}) increases
as we go down the column. Now consider, the basis vectors of DCTII and DSTVII in Fig. 2(b). For
DCT, the ﬁrst basis vector is ﬂat, while for DST, the ﬁrst basis vector increases as we go from left to
right, with zero at the left boundary, and maxima at the right boundary. Hence, the ﬁrst basis vector of
DST mimics the prediction error residue behavior more efﬁciently as compared to the DCT basis vector,
and DST would be a better transform to compress such a kind of residue. For a detailed mathematical
derivation of this, we refer the reader to [7] and [13]. We also show the exact integer based approximations
of DCTII and DSTVII transform matrices being used in the HEVC standardization in Fig. 3.
C. Performance of DCT/DST Intracoding scheme
Here, we present the results when the DCT/DST scheme is applied at size 4x4 in the HEVC Test Model.
Note that, additional gains of applying the DCT/DST scheme on 8x8 to 32x32 blocks are minimal given
the high implementation complexity of 8x8 and 32x32 DST in hardware, and can be found in [16]. In fact,
a different category of transforms: secondary transforms can provide further gains for higher order blocks,
and we refer the reader to [14] for more details about the secondary transforms. The DCT/DST scheme
is only applied to Intra Transform Units and Luma component. For Chroma and Inter Transform Units,
the conventional DCT is always used. For our simulations, we encoded full length sequences (which had
150 to 600 frames) at various resolutions varying from 416x240 to 2560x1600. These video sequences
constitute the designated test set for the HEVC standardization. Full details about the GOP size, Intra
period, coding structure of these video sequences etc. are available in [2]. Note that we present the results
for only the evaluation of the DCT/DST as an alternate transform here instead of DCT and retain all the
other testing settings as in [2]. We present the results in HM 7.0, the seventh version of the Test Model
for HEVC standardization for the following 4 settings: Intra High Efﬁciency10 bits, IntraMain, Random
Access High Efﬁciency10 bits, and Random Access Main as stipulated in the Common Test Conditions
[2].
Table I show the average BjøntegaardDelta rate (BDrate) [1] gains in Luma for the DCT/DST scheme
in HM 7.0. Classes A to E include the various test sequences that are the designated test set for the
HEVC standardization. More details about these sequences is in [2]. For Class E sequences, simulations
were not performed for Random Access conditions as stipulated in the common test conditions for the
15
TABLE I
BDRATE [1] GAINS WHEN THE DCT/DST SCHEME IS APPLIED TO 4X4 INTRA LUMA BLOCKS FOR VARIOUS VIDEO SEQUENCES FOR
THE INTRA HIGH EFFICIENCY (HE)10 BITS, INTRA MAIN, RANDOM ACCESS HE10 BITS AND RANDOM ACCESS MAIN SETTINGS IN
HM 7.0. NOTE THAT NEGATIVE BDRATE MEANS COMPRESSION GAIN.
Sequence Name Intra
HE
10
Intra
Main
RA
HE
10
RA
Main
Class A Trafﬁc −0.8 −0.9 −0.4 −0.6
2560x1600 PeopleOnStreet −1.2 −1.3 −0.4 −0.5
Nebuta −0.1 −0.1 0.0 0.0
StreamLocomotive −0.1 −0.1 0.1 0.3
Class B Kimono −0.3 −0.3 −0.1 −0.1
1920x1080 ParkScene −0.8 −0.9 −0.4 −0.5
Cactus −0.8 −0.9 −0.3 −0.4
BasketballDrive −0.5 −0.6 −0.3 −0.3
BQTerrace −0.6 −0.7 −0.1 −0.2
Class C BasketballDrill −1.1 −1.3 −0.5 −0.7
832x480 BQMall −0.9 −1.1 −0.4 −0.4
PartyScene −0.6 −0.6 −0.3 −0.3
RaceHorsesC −0.6 −0.8 −0.3 −0.3
Class D BasketBallPass −1.0 −1.0 −0.5 −0.4
416x240 BQSquare −0.8 −0.9 −0.2 −0.2
BlowingBubbles −0.8 −0.9 −0.3 −0.3
RaceHorsesD −0.9 −1.2 −0.5 −0.5
Class E FourPeople −1.4 −1.5 N/A N/A
1280x720 Johnny −1.0 −1.3 N/A N/A
KristenAndSara −1.7 −2.1 N/A N/A
Summary Average −0.8 −0.9 −0.3 −0.3
Enc. Time (%) 102 101 100 100
Dec. Time (%) 101 101 100 100
HEVC standardization. In these simulations, the comparison was made within (a) HM 7.0 with the mode
dependent DCT/DST scheme disabled, and (b) HM 7.0 with the modedependent DCT/DST scheme
enabled. From these set of test results, the average gain for the Luma component when the DCT/DST
scheme for the Intra High Efﬁciency10 bits and Intra Main settings is 0.8%, and 0.9% respectively.
For the Random Access High Efﬁciency10 bits, and Random Access Main settings, the average gains
are around 0.3 %. Note that unlike H.264/AVC days, most of the tools adopted in the HEVC standard
provide gains only in the range of 0.5% to 1.5%, and not in the range of 510%. The average encoding
and decoding times for the HM codec remain almost the same by using DST as an alternate transform
than DCT, since there is only one condition to be checked (intra prediction mode) and decide whether to
apply the DCT or DST.
Finally, note that in the HEVC standardization, recently a simpliﬁcation of whether to apply DCT or
DST as the horizontal (respectively vertical) transform for DC and Category 1 Oblique modes was adopted
[19]. More speciﬁcally, it was proposed to always use DST as the horizontal and vertical transform. It
was shown that such a simpliﬁcation brings around 0.1 % loss in average BDRates, but does not require
a switching logic to choose between DCT and DST in the codec depending on the intra prediction mode.
This may possibly improve the throughput in a hardware codec.
III. FAST FACTORIZATIONS OF DCT/DST TRANSFORMS
The purpose of this section is twofold. On one hand, we would like to show that Intraprediction
transforms adopted in HEVC standard allow fast factorizations, and show example factorizations for
16
transforms of length N = 4. On the other hand, we also would like to discuss relationships between
DCTII and DSTVII transforms of different sizes and show several possible joint factorizations of such
transforms. Such factorizations may be of interest for efﬁcient hardware implementations and future uses
in video coding.
For background information and prior results regarding factorizations of DCTII and DSTVII transforms
the reader is referred to [3], [5], [8], [10], [20], [21].
New results presented in this section include: a) decomposition of an 2N + 1point DCTII into an
N + 1point DCTVI and N point DSTVII, b) fast algorithms for computing DCTVI and DCTVII
transforms, c) computation of DCTII of composite sizes 2k N(2N + 1) utilizing N point DCTII and
DSTVII as building blocks.
A. Notation
Let N be the length of data sequence. The matrices of Discrete Fourier Transform (DFT) and Discrete
Cosine and Sine transforms of types II, III, VI, and VII will be denoted as follows:
DFT: [FN ]mn = e
−j 2πmn
N , m, n = 0, . . . , N − 1
DCTII:
[
CIIN
]
mn
= cos
(
m(2n+1)π
2N
)
, m, n = 0, . . . , N − 1;
DCTIII:
[
CIIIN
]
mn
= cos
(
(2m+1)nπ
2N
)
, m, n = 0, . . . , N − 1;
DCTVI:
[
CV IN+1
]
mn
= cos
(
m(2n+1)π
2N+1
)
, m, n = 0, . . . , N ;
DCTVII:
[
CV IIN+1
]
mn
= cos
(
(2m+1)nπ
2N+1
)
, m, n = 0, . . . , N ;
DSTVI:
[
SV IN
]
mn
= sin
(
(m+1)(2n+1)π
2N+1
)
, m, n = 0, . . . , N − 1;
DSTVII:
[
SV IIN
]
mn
= sin
(
(2m+1)(n+1)π
2N+1
)
, m, n = 0, . . . , N − 1;
In the above deﬁnitions, we have intentionally omitted normalization constants (such as
√
2/N and
λi = (i = 0)?1/
√
2 : 1 conventionally used in DCTII) as they don’t affect factorization structures of the
transforms. Subindices N or N + 1 indicate lengths of transforms. We follow Z.Wang and B.R.Hunt’s
convention coupling N point DSTVI/VII with N+1point DCTVI/VII [20]. As easily noticed, transforms
of types II and III, as well as VI and VII are closely related:(
CIIN
)T
= CIIIN ;
(
CV IN+1
)T
= CV IIN+1;
(
SV IN
)T
= SV IIN .
B. Relationship between 2N + 1point DCTII, N + 1point DCTVI, and N point DSTVII transforms
In this section we prove the following statement.
Theorem 1. The following holds:
CII2N+1 = P2N+1
⎛
⎝ CV IN+1
SV IIN
⎞
⎠
⎛
⎝ IN JN1
−JN IN
⎞
⎠ (1)
where P2N+1 is a matrix, such that when applied to a vector x, it produces the following sign alterations
and reordering:
x˜2i = xi i = 0, . . . , N,
x˜2i+1 = (−1)ixN+1+i i = 0, . . . , N − 1, (2)
and IN and JN are N ×N identity and orderreversal matrices respectively.
17
yN
yN2
yN1
y0
yN1
y1
yN2
y1
y0
Y1
Y0
Y2
xN
xN+1
xN2
x0
x2N1
x2N
x1
xN1
xN+2
YN2
Y1
YN
YN1
YN1
Y0
X0
X2
X4
X2N2
X2N
X1
X3
X2N1
X2N3
YN3 X2N5x2N2
Y2
X5
y2x2
YN2 X2N2
Npoint
DCTII
Npoint
DCTII
x0
Npoint
DCTII
Npoint
DCTII
Npoint
DCTII
2N+1point
DCTII
2N+1point
DCTII
2N+1point
DCTII
P
o
s
t

a
d
d
it
io
n
s
a
n
d
o
u
t
p
u
t
in
d
e
x
m
a
p
p
in
g
Npoint
DSTVII
Npoint
DSTVII
Npoint
DSTVII
x1
xN(2N+1)2
xN(2N+1)1
X0
X1
xN(2N+1)2
xN(2N+1)1
I
n
p
u
t
in
d
e
x
m
a
p
p
in
g
N+1 – point
DCTVI
N – point
DSTVII








X2
xN(2N+1)3
x2
xN(2N+1)3yN3
Fig. 4. Computing DCTII of composite sizes: (a) split of 2N + 1point DCTII into an N + 1point DCTVI and N point DSTVI; (b)
computation of DCTII of length N(2N + 1).
Proof: Let us consider a 2N + 1long input sequence x = x0, . . . , x2N , and apply DCTII over it:
XCIIk =
2N∑
n=0
xn cos
π(2n+ 1)k
2(2N + 1)
, k = 0, . . . , 2N .
We ﬁrst look at even output values (k = 2i, i = 0,. . . ,N):
XCII2i =
2N∑
n=0
xn cos
π(2n+ 1)2i
2(2N + 1)
=
2N∑
n=0
xn cos
π(2n+ 1)i
2N + 1
.
We can split this sum as follows:
XCII2i =
N∑
n=0
xn cos
π(2n+ 1)i
2N + 1
+
2N∑
n=N+1
xn cos
π(2n+ 1)i
2N + 1
=
N∑
n=0
xn cos
π(2n+ 1)i
2N + 1
+
2N∑
n=N+1
xn cos
π(2(2N + 1)− (2n+ 1))i
2N + 1
=
N∑
n=0
xn cos
π(2n+ 1)i
2N + 1
+
2N∑
n=N+1
xn cos
π(2(2N − n) + 1)i
2N + 1
=
N∑
n=0
xn cos
π(2n+ 1)i
2N + 1
+
N−1∑
n=0
x2N−n cos
π(2n+ 1)i
2N + 1
,
18
which implies that
XCII2i = X
CV I
i +X
′CV I
i
where XCV I is a DCTVI transform over the ﬁrst N + 1 elements of input sequence x, and X ′CV I is a
DCTVI transform over the following input:
x′n =
[
x2N−n, if n = 0, . . . , N − 1,
0, if n = N.
We now turn out attention to the odd output values (k = 2i+ 1, i = 0, . . . , N − 1):
XCII2i+1 =
2N∑
n=0
xn cos
π(2n+ 1)(2i+ 1)
2(2N + 1)
= (−1)i
2N∑
n=0
xn sin
π(2n+ 1 + 2N + 1)(2i+ 1)
2(2N + 1)
= (−1)i
2N∑
n=0
xn sin
π(n+N + 1)(2i+ 1)
2N + 1
= = (−1)i+1
2N∑
n=0
x2N−n sin
π(N − n)(2i+ 1)
2N + 1
We can split this sum as follows:
XCII2i+1 = (−1)i+1
[
N−1∑
n=0
x2N−n sin
π(N − n)(2i+ 1)
2N + 1
+
2N∑
n=N+1
x2N−n sin
π(N − n)(2i+ 1)
2N + 1
]
= (−1)i+1
[
N−1∑
n=0
xN+1+n sin
π(n+ 1)(2i+ 1)
2N + 1
−
N−1∑
n=0
xN−1−n sin
π(n+ 1)(2i+ 1)
2N + 1
]
,
which implies that
XCII2i+1 = (−1)i+1 X˜SV IIi + (−i)i XˆSV IIi
where X˜SV II is an N point DSTVII transform over a sequence
x˜n = xN+1+n, n = 0, . . . , N − 1,
and XˆSV II is an N point DSTVII transform over a sequence
xˆn = xN−1−n, n = 0, . . . , N − 1.
By combining all these mappings we arrive at expression (1).
We present a ﬂowgraph of the resulting mapping between DCTII, DCTVI, and DSTVII transforms
in Figure 4.a. Only permutations and sign changes are needed to convert output of DCTVI, and DSTVII
into DCTII.
19
��
��
��
��
��
��
�
�
��
��
��
��
��
��
��
�
�
��
���
���
( )������ π−
( )������ π−
( )������ π−
( )����� π−
( )������ π−
( )��
�� π
( )��
�� π
( )��
�� π
�������
���
��
��
��
�
��
��
��
��
�
��
��
��
��
��
��
��
�
�
���
���
( )����� π
( )����� π
( )����� π
( )���� π
( )����� π
( )��
�� π
( )��
�� π
( )��
�� π
�������
����
�
�
�
�
�
�
�������
����
�
�
�
� �
�
�
�
�������
��
������
Fig. 5. Flowgraph of Winograd’s factorization of DFT of length 9, and ﬂowgraphs of 9point DCTII, 5point DCTVI, and 4point
DSTVII implied by mappings (1,3).
C. Fast algorithms for computing DCTII, DCTVI, and DSTVII
From (1) we already know that fast computation of DCTVI and DSTVII can be reduced to computing
subsets of DCTII. In turn, thanks to Heideman[8] it is also known that computation of DCTII of odd
numbers is equivalent to computing samelength DFT. Considering length 2N + 1transforms, we can
summarize Heideman’s result as follows:
CII2N+1 = H1
(
[�(F2N+1)]rows 0,...,N
[�(F2N+1)]rows N+1,...,2N
)
H2, (3)
where � (F2N+1) and � (F2N+1) denote real and imaginary parts of the DFT transform matrix of size
2N + 1, and H1 and H2 are permutation and signinversion matrices [8].
In combination with (1) this formula shows that an N + 1point DCTVI, an N point DSTVII and an
2N + 1point DCTII can be computed by mapping to a 2N + 1point DFT.
Many algorithms for fast computing of DFT are readily available [4]. For example, in case of N = 4,
we can use Winograd DFT module of length 9 shown in Figure 5.a. By using mappings (3) and (1) we
adopt it to produce 9point DCTII, 5point DCTVI, and 4point DSTVII. This is shown in Figure 5.b.
We note that all the resulting algorithms are very efﬁcient in terms of multiplicative complexity.
Thus, obtained 9point DCTII requires only 8 nontrivial multiplications. In contrast, the least complex
algorithms for computing DCTII of size 8 (nearest dyadicsize) requires 11 multiplications[12]. The
obtained 4point DSTVII is also very efﬁcient: it uses only 5 multiplications. This DCTVII factorization
is immediately suitable for implementing transform in HEVC video codec.
We show 4point integer transforms adopted in HEVC standard in Figure 6. In same picture, we also
show ﬂowgraphs of the the corresponding DCTII and DSTVII transforms, and modiﬁed versions of these
ﬂowgraphs allowing computation of integer transforms deﬁned in HEVC standard.
D. Fast computation of transforms of lengths 2kN(2N + 1)
It is known that a transform of a composite length N = pq, where p and q are coprime, can be
decomposed into a cascade of p qpoint transforms and q ppoint transforms followed by pq − p− q − 1
additions. This class of techniques is called Prime Factor Algorithms (PFA) [11], [22].
In Figure 4.b, we show how to compute DCTII of length N(2N + 1). This factorization includes
2N + 1 N point DCTII subtransforms, and additionally N 2N + 1point DCTII transforms, which, in
20
(a) (b) (c)
0 0
1 1
2 2
3 3
~128 4
64 64 64 64
83 36 36 83
64 64 64 64
36 83 83 36
IIC
X x
X x
X x
X x
×
⎛ ⎞ ⎛ ⎞⎛ ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟
− −
⎜ ⎟ ⎜ ⎟⎜ ⎟
= ×
⎜ ⎟ ⎜ ⎟⎜ ⎟
− −
⎜ ⎟ ⎜ ⎟⎜ ⎟
− −⎝ ⎠⎝ ⎠ ⎝ ⎠
�����������


( )31 82 cos π
( )31 82 sin π−( )31 82 sin π
( )31 82 cos π
0x
1x
2x
3x
1
2
1
2
0X
1X
2X
3X
(d) (e) (f)
4
0 0
1 1
2 2
3 3
~128
29 55 74 84
74 74 0 74
84 29 74 55
55 84 74 29
VIIS
X x
X x
X x
X x
×
⎛ ⎞ ⎛ ⎞⎛ ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟
−
⎜ ⎟ ⎜ ⎟⎜ ⎟
= ×
⎜ ⎟ ⎜ ⎟⎜ ⎟
− −
⎜ ⎟ ⎜ ⎟⎜ ⎟
− −⎝ ⎠⎝ ⎠ ⎝ ⎠
�����������






84
74
74
29
55
0X
1X
2X
3X
0x
1x
2x
3x






0X
1X
2X
3X
0x
1x
2x
3x
( )2 299 sin π
( )2 239 sin π
( )2 199 sin π
( )2 499 sin π
( )2 239 sin π


83−83
36
0x
1x
2x
3x
0X
1X
2X
3X
36
64

64
Fig. 6. Factorizations of 4point HEVC transforms: (a) and (d) show integer transform matrices deﬁned by the standard, (c) and (f) show
factorizations of DSTVII and DCTII transforms, (b) and (e) show how same factorizations can be applied to compute integer transforms.
turn include N point DSTVII as part of their ﬂowgraph. Hence, a system that implements and uses N 
point DCTII and DSTVII, can easily compute an N(2N +1) transform by reusing them. Same principle
more generally applies to computing transforms of lengths 2kN(2N + 1).
We note, that in addition to ability to reuse N point DCTII and DSTVII, transforms of lengths
2kN(2N + 1) may be considerably faster than transforms of nearest dyadic sizes. We illustrate this
in Table 1. Here, we show complexities of factorizations of DCTII of lengths 36 and 72 and show that
they are less complex (in terms of multiplications) that fastest known (FeigWinograd [6]) algorithms for
computing DCTII of lengths 32 and 64 respectively.
TABLE II
COMPARISON OF SEVERAL EXISTING AND PROPOSED FACTORIZATIONS OF DCT AND DST TRANSFORMS.
Transform N Algorithm Features Complexity Matrixvector product
DCTII 4 Loefﬂer, et al. [12] 4 muls, 9 adds 16 muls, 12 adds
DSTVII 4 this paper, [5] 5 muls, 11 adds 16 muls, 12 adds
DCTVI 5 this paper 3 muls, 15 adds 25 muls, 20 adds
DCTII 8 Loefﬂer, et al. [12] 11 muls, 29 adds 64 muls, 56 adds
DCTII 9 this paper, Heideman [8] uses 4point DSTVII 8 muls, 34 adds 81 muls, 72 adds
DCTII 32 Feig,Winograd [6] 76 muls, 209 adds 1024 muls, 992 adds
DCTII 36 this paper uses 4point DSTVII 68 muls, 239 adds 1296 muls, 1260 adds
DCTII 64 Feig,Winograd [6] 184 muls, 513 adds 4096 muls, 4032 adds
DCTII 72 this paper uses 4point DSTVII 163 muls, 587 adds 5184 muls, 5112 adds
IV. CONCLUSIONS
In this paper, we have provided an overview of the DCT/DST transform scheme for intra coding in the
HEVC standard. We have shown that such transforms allow efﬁcient factorizations by utilizing mappings
between DSTVII, DCTII, and DFT. We have also produced several results on joint computation of DCTII
and DSTVII transforms of several sizes, which might be of interest for implementation of future video
coding algorithms.
REFERENCES
[1] G. Bjøntegaard, “Calculation of Average PSNR Differences between RD curves,” ITUT SG16/Q6, VCEGM33, April 2001.
21
[2] F. Bossen, “Common HM test conditions and software reference conﬁgurations,” ITUT and ISO/IEC JCTVCI1100, May 2012.
[3] V. Britanak, K. R. Rao, and P. Yip, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer
Approximations. Oxford, UK: Academic Press  Elsevier, 2007.
[4] C. Burrus and T. Parks, DFT/FFT and Convolution Algorithms Theory and Implementation. New York: Wiley, 1985.
[5] R. K. Chivukula and Y. A. Reznik, “Fast computing of discrete cosine and sine transforms of types vi and vii,” in Applications of
Digital Image Processing XXXIV, ser. Proc. SPIE, A. G. Tescher, Ed., vol. 8135, 2011, pp. 813 505–1–14.
[6] E. Feig and S. Winograd, “On the multiplicative complexity of discrete cosine transforms (corresp.),” IEEE Trans. Info. Theory, vol.
IT38, pp. 1387 – 1391, 1992.
[7] J. Han, A. Saxena, and K. Rose, “Towards jointly optimal spatial prediction and adaptive transform in video/image coding,” in IEEE
ICASSP, March 2010, pp. 726–729.
[8] M. T. Heideman, “Computation of an oddlength DCT from a realvalued DFT of the same length,” IEEE Trans. Signal Processing,
vol. 40, no. 1, pp. 54–61, 1992.
[9] J. H. Min, S. Lee, IlKoo Kim, WooJin Han, J. Lainema and K. Ugur, “Uniﬁcation of the directional intra prediction methods in
TMuC,” ITUT and ISO/IEC JCTVCB100, July 2010.
[10] A. K. Jain, “A sinusoidal family of unitary transforms,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI1, no. 4,
pp. 356 –365, Oct. 1979.
[11] B. G. Lee, “Input and output index mapping for a primefactor decomposed computation of discrete cosine transform,” IEEE Trans.
Acoust., Speech, Signal Processing, vol. 37, no. 2, pp. 237 – 244, Feb. 1989.
[12] C. Loefﬂer, A. Ligtenberg, and G. Moschytz, “Practical fast 1D DCT algorithms with 11 multiplications,” in IEEE Int. Conf. Acoust.,
Speech, Signal Processing (ICASSP), vol. 2, May 1989, pp. 988–991.
[13] A. Saxena and F. Fernandes, “Mode Dependent DCT/DST for intra prediction in blockbased image/video coding,” in IEEE ICIP, Sept
2011, pp. 1721–1724.
[14] ——, “On secondary transforms for intra prediction residual,” in IEEE ICASSP, March 2012, pp. 1201–1204.
[15] ——, “Jointly optimal intra prediction and adaptive primary transform,” ITUT and ISO/IEC JCTVCC108, Oct 2010.
[16] ——, “Modedependent DCT/DST for intraprediction in video coding,” ITUT and ISO/IEC JCTVCD033, Jan 2011.
[17] ——, “Modedependent DCT/DST without 4*4 full matrix multiplication for intra prediction,” ITUT and ISO/IEC JCTVCE125,
March 2011.
[18] K. Ugur and O. Bici, “Performance evaluation of DST in intra prediction,” ITUT & ISO/IEC JCTVCI0582, May 2012.
[19] K. Ugur and A. Saxena, “Summary report of Core Experiment on intra transform mode dependency simpliﬁcations,” ITUT & ISO/IEC
JCTVCJ0021, July 2012.
[20] Z. Wang and B. R. Hunt, “The discrete W transform,” Applied Mathematics and Computation, vol. 16, no. 1, pp. 19 – 48, 1985.
[21] S. Winograd, “On computing the Discrete Fourier Transform,” Proc. National Academy of Sciences, USA, vol. 73, pp. 1006–1016,
1976.
[22] P. Yang and M. Narasimha, “Prime factor decomposition of the discrete cosine transform,” in IEEE Int. Conf. Acoust., Speech, Signal
Processing (ICASSP), March 1985, pp. pp. 772–775.
22
InterDigital develops mobile technologies that are at the core of devices, networks, and services worldwide. We solve many of the industry's most critical and complex technical challenges, inventing solutions for more efficient broadband networks and a richer multimedia experience years ahead of market deployment.