The Vault

Fast transforms for Intra-prediction-based image and video coding
Research Paper / Feb 2014

 

Fast transforms for Intra-prediction-based image

and video coding

 

Ankur Saxena, Felix C. Fernandes

Samsung Telecommunications America

 

1301 E. Lookout Dr, Richardson, TX 75082, USA

Email: {asaxena,fernandes}@sta.samsung.com

 

Yuriy A. Reznik

InterDigital Communications

 

9710 Scranton Rd, San Diego, CA 92121, USA

Email: yreznik@ieee.org

 

Abstract

 

In this paper, we provide an overview of the DCT/DST transform scheme for intra coding in the HEVC

standard. A unique feature of this scheme is the use of DST-VII transforms in addition to DCT-II. We further

derive factorizations for fast joint computation of DCT-II and DST-VII transforms of several sizes. Simulation

results for the DCT/DST scheme in the HM reference software for HEVC are also provided together with a

discussion on computational complexity.

 

I. INTRODUCTION

Most image and video-coding standards such as JPEG, MPEG-2, MPEG-4 AVC/H.264, employ block-

 

based transform coding as a framework for efficient image and video compression. The pixel domain data is

transformed to frequency domain using a transform process on a block-by-block basis. For typical images,

most of the energy is concentrated in low-frequency transform coefficients. Following the transform, a

relatively large step-size quantizer can be used for higher-frequency transform coefficients in order to

compact energy more efficiently and to attain better compression. Hence it is required to devise the

optimal transform for each image block to fully decorrelate the transform coefficients. The Karhunen-Loeve

Transform (KLT) is known to be optimal under certain conditions for transform coefficients. However,

practical use of KLT is limited due to its high computational complexity. It is well known that Discrete

Cosine Transform of Type II (DCT-II) provides an attractive alternative to KLT in terms of decorrelation

and energy compaction [3] and performance close to KLT. But with the advent of intra prediction, this is

no longer the case and the optimal transform should adapt to the intra prediction mode.

 

Recently, Han, Saxena & Rose [7] analytically derived Discrete Sine Transform of Type-VII (DST-VII)

with frequency and phase components different from the conventional DCT 1to be the actual KLT along the

prediction direction for intra modes in H.264/AVC. They also showed that if prediction is not performed

along a particular direction, then DCT performs close to KLT. The idea was applied to the vertical

and horizontal modes in intra-prediction in H.264/AVC and a combination of the DST and conventional

DCT was used for the vertical and horizontal modes. Saxena & Fernandes in [13], [15] showed that

a combination of DCT-II and DST-VII is indeed the optimal transform for all the oblique modes in

unified intra prediction [9] in the HEVC standardization. The DCT/DST transform scheme described in

this paper was also adopted in the HEVC standardization for block size 4x4 for Intra Luma blocks [17].

Recently, a simplification of the DCT/DST technology [18] was adopted in HEVC which removes the

mode-dependency of whether to select between DCT or DST as the horizontal (and respectively vertical)

transform, and DST was always used for the 4x4 Intra Luma blocks.

 

In the literature, until recently there were no well-known algorithms for finding fast implementations

of the DST-VII similar to various fast implementations for DCT and DST Types I to IV. Recently,

Chivukula and Reznik have presented a technique for finding fast implementations for DST-VII and

DST-VI (transpose of DST-VII) in [5], where they show that the 4x4 DST transform matrix can be

 

1In the remainder of the paper, unless otherwise specified, DCT and DST are referred to as DCT-II and DST-VII respectively.

 

2013 Data Compression Conference

 

1068-0314/13 $26.00 © 2013 IEEE

DOI 10.1109/DCC.2013.9

 

13

 

 

 

��� ��� ���

 

Fig. 1. Examples of Category 1 oblique modes are shown in (a) and (b) where prediction is performed from one direction only: either the

top row or left column. (c) shows example of Category 2 oblique mode where prediction is performed from both directions top row and left

column.

 

(a) (b)

 

Fig. 2. (a) Vertical Intra prediction mode, and (b) DCT and DST basis vectors. For DCT, the first basis vector is flat. For DST, the first

basis vector is zero at one boundary, and maximum at the other boundary. This mimics the prediction error residue behavior for vertical

mode more efficiently as compared to the DCT basis vector.

 

implemented using 5 multiplications. Note that, by contrast the 4x4 DCT in HM, the Test Model for

HEVC standardization is implemented via 6 multiplications. In this paper, we extend study of factorization

techniques started in [5], and show that for certain sizes DCT-II and DST-VII transforms allow joint

factorizations. Such embedded joint factorizations can be of interest for efficient hardware implementations.

 

The rest of the paper is organized as follows: Sec. II outlines the details of unified intra prediction, and

the discussion about the DCT/DST transform scheme in the HEVC standardization along with experimental

results. Sec. III discusses fast factorization of DCT-II and DST-VII transforms, followed by Conclusions

in Sec. IV.

 

II. INTRA-PREDICTION CODING MODES AND TRANSFORMS IN HEVC STANDARD

A. Review of Unified Intra Prediction

 

The ongoing HEVC standardization uses unified intra prediction in which up to 35 different intra-

prediction modes for the Luma component of the video at a particular block size can be defined. In

general, these 35 intra-prediction modes can be divided into 3 categories as follows:

 

1) Category 1 oblique modes (Fig. 1): Here prediction is performed from the decoded pixels from

either the top row or the left column. The vertical mode and the horizontal mode in [9] are special

cases of this oblique mode when prediction direction is vertical or horizontal respectively.

 

2) Category 2 oblique modes (Fig. 1): Here prediction is performed from both the top row and the left

column pixels.

 

3) DC and Planar modes: These are non-directional modes, and can be used for fitting flat regions in

case of DC mode, or surface fitting in case of Planar mode.

 

14

 

 

 

Tint

DST-VII

 

=

 

⎢⎢⎣

 

29 55 74 84

74 74 0 −74

84 −29 −74 55

55 −84 74 −29

 

⎥⎥⎦ ∼ 128

 

[

2√

 

2N + 1

sin

 

(

π(2i+ 1)(j + 1)

 

2N + 1

 

)]

i,j=0,...,N−1

N=4

 

Tint

DCT-II

 

=

 

⎢⎢⎣

 

64 64 64 64

83 36 −36 −83

64 −64 −64 64

36 −83 83 −36

 

⎥⎥⎦ ∼ 128

 

[√

2

 

N

λ(i) cos

 

(

πi(2j + 1)

 

2N

 

)]

i,j=0,...,N−1

N=4

 

.

 

where : λ(i) =

 

[ 1√

2

, if i = 0

 

0, otherwise

 

Fig. 3. Integer transform matrices defined in HEVC standard and their relationship to 4-point DST-VII and DCT-II transforms.

 

B. Transforms in HEVC for intra prediction modes

In the HEVC, standard, the DCT/DST transform scheme described was adopted in [17] for block size

 

4x4 for Intra Luma blocks. As an example, we briefly describe why DST is the optimal transform rather

than DCT for some prediction modes. Specifically, consider the “Vertical” prediction mode in Fig. 2(a),

which is a special case of Category 1 Oblique modes. Here, after prediction from the top row pixel x0j (or

its reconstruction), the energy in the prediction residues xij (for j ∈ {1, N}, and for i ∈ {1, N}) increases

as we go down the column. Now consider, the basis vectors of DCT-II and DST-VII in Fig. 2(b). For

DCT, the first basis vector is flat, while for DST, the first basis vector increases as we go from left to

right, with zero at the left boundary, and maxima at the right boundary. Hence, the first basis vector of

DST mimics the prediction error residue behavior more efficiently as compared to the DCT basis vector,

and DST would be a better transform to compress such a kind of residue. For a detailed mathematical

derivation of this, we refer the reader to [7] and [13]. We also show the exact integer based approximations

of DCT-II and DST-VII transform matrices being used in the HEVC standardization in Fig. 3.

 

C. Performance of DCT/DST Intra-coding scheme

Here, we present the results when the DCT/DST scheme is applied at size 4x4 in the HEVC Test Model.

 

Note that, additional gains of applying the DCT/DST scheme on 8x8 to 32x32 blocks are minimal given

the high implementation complexity of 8x8 and 32x32 DST in hardware, and can be found in [16]. In fact,

a different category of transforms: secondary transforms can provide further gains for higher order blocks,

and we refer the reader to [14] for more details about the secondary transforms. The DCT/DST scheme

is only applied to Intra Transform Units and Luma component. For Chroma and Inter Transform Units,

the conventional DCT is always used. For our simulations, we encoded full length sequences (which had

150 to 600 frames) at various resolutions varying from 416x240 to 2560x1600. These video sequences

constitute the designated test set for the HEVC standardization. Full details about the GOP size, Intra

period, coding structure of these video sequences etc. are available in [2]. Note that we present the results

for only the evaluation of the DCT/DST as an alternate transform here instead of DCT and retain all the

other testing settings as in [2]. We present the results in HM 7.0, the seventh version of the Test Model

for HEVC standardization for the following 4 settings: Intra High Efficiency-10 bits, Intra-Main, Random

Access High Efficiency-10 bits, and Random Access Main as stipulated in the Common Test Conditions

[2].

 

Table I show the average Bjøntegaard-Delta rate (BD-rate) [1] gains in Luma for the DCT/DST scheme

in HM 7.0. Classes A to E include the various test sequences that are the designated test set for the

HEVC standardization. More details about these sequences is in [2]. For Class E sequences, simulations

were not performed for Random Access conditions as stipulated in the common test conditions for the

 

15

 

 

 

TABLE I

BD-RATE [1] GAINS WHEN THE DCT/DST SCHEME IS APPLIED TO 4X4 INTRA LUMA BLOCKS FOR VARIOUS VIDEO SEQUENCES FOR

 

THE INTRA HIGH EFFICIENCY (HE)-10 BITS, INTRA MAIN, RANDOM ACCESS HE-10 BITS AND RANDOM ACCESS MAIN SETTINGS IN

HM 7.0. NOTE THAT NEGATIVE BD-RATE MEANS COMPRESSION GAIN.

 

Sequence Name Intra

HE-

10

 

Intra

Main

 

RA-

HE-

10

 

RA

Main

 

Class A Traffic −0.8 −0.9 −0.4 −0.6

2560x1600 PeopleOnStreet −1.2 −1.3 −0.4 −0.5

 

Nebuta −0.1 −0.1 0.0 0.0

StreamLocomotive −0.1 −0.1 0.1 0.3

 

Class B Kimono −0.3 −0.3 −0.1 −0.1

1920x1080 ParkScene −0.8 −0.9 −0.4 −0.5

 

Cactus −0.8 −0.9 −0.3 −0.4

BasketballDrive −0.5 −0.6 −0.3 −0.3

BQTerrace −0.6 −0.7 −0.1 −0.2

 

Class C BasketballDrill −1.1 −1.3 −0.5 −0.7

832x480 BQMall −0.9 −1.1 −0.4 −0.4

 

PartyScene −0.6 −0.6 −0.3 −0.3

RaceHorsesC −0.6 −0.8 −0.3 −0.3

 

Class D BasketBallPass −1.0 −1.0 −0.5 −0.4

416x240 BQSquare −0.8 −0.9 −0.2 −0.2

 

BlowingBubbles −0.8 −0.9 −0.3 −0.3

RaceHorsesD −0.9 −1.2 −0.5 −0.5

 

Class E FourPeople −1.4 −1.5 N/A N/A

1280x720 Johnny −1.0 −1.3 N/A N/A

 

KristenAndSara −1.7 −2.1 N/A N/A

Summary Average −0.8 −0.9 −0.3 −0.3

 

Enc. Time (%) 102 101 100 100

Dec. Time (%) 101 101 100 100

 

HEVC standardization. In these simulations, the comparison was made within (a) HM 7.0 with the mode-

dependent DCT/DST scheme disabled, and (b) HM 7.0 with the mode-dependent DCT/DST scheme

enabled. From these set of test results, the average gain for the Luma component when the DCT/DST

scheme for the Intra High Efficiency-10 bits and Intra Main settings is 0.8%, and 0.9% respectively.

For the Random Access High Efficiency-10 bits, and Random Access Main settings, the average gains

are around 0.3 %. Note that unlike H.264/AVC days, most of the tools adopted in the HEVC standard

provide gains only in the range of 0.5% to 1.5%, and not in the range of 5-10%. The average encoding

and decoding times for the HM codec remain almost the same by using DST as an alternate transform

than DCT, since there is only one condition to be checked (intra prediction mode) and decide whether to

apply the DCT or DST.

 

Finally, note that in the HEVC standardization, recently a simplification of whether to apply DCT or

DST as the horizontal (respectively vertical) transform for DC and Category 1 Oblique modes was adopted

[19]. More specifically, it was proposed to always use DST as the horizontal and vertical transform. It

was shown that such a simplification brings around 0.1 % loss in average BD-Rates, but does not require

a switching logic to choose between DCT and DST in the codec depending on the intra prediction mode.

This may possibly improve the throughput in a hardware codec.

 

III. FAST FACTORIZATIONS OF DCT/DST TRANSFORMS

The purpose of this section is two-fold. On one hand, we would like to show that Intra-prediction

 

transforms adopted in HEVC standard allow fast factorizations, and show example factorizations for

 

16

 

 

 

transforms of length N = 4. On the other hand, we also would like to discuss relationships between

DCT-II and DST-VII transforms of different sizes and show several possible joint factorizations of such

transforms. Such factorizations may be of interest for efficient hardware implementations and future uses

in video coding.

 

For background information and prior results regarding factorizations of DCT-II and DST-VII transforms

the reader is referred to [3], [5], [8], [10], [20], [21].

 

New results presented in this section include: a) decomposition of an 2N + 1-point DCT-II into an

N + 1-point DCT-VI and N -point DST-VII, b) fast algorithms for computing DCT-VI and DCT-VII

transforms, c) computation of DCT-II of composite sizes 2k N(2N + 1) utilizing N -point DCT-II and

DST-VII as building blocks.

 

A. Notation

Let N be the length of data sequence. The matrices of Discrete Fourier Transform (DFT) and Discrete

 

Cosine and Sine transforms of types II, III, VI, and VII will be denoted as follows:

 

DFT: [FN ]mn = e

−j 2πmn

 

N , m, n = 0, . . . , N − 1

DCT-II:

 

[

CIIN

 

]

mn

 

= cos

(

 

m(2n+1)π

2N

 

)

, m, n = 0, . . . , N − 1;

 

DCT-III:

[

CIIIN

 

]

mn

 

= cos

(

 

(2m+1)nπ

2N

 

)

, m, n = 0, . . . , N − 1;

 

DCT-VI:

[

CV IN+1

 

]

mn

 

= cos

(

 

m(2n+1)π

2N+1

 

)

, m, n = 0, . . . , N ;

 

DCT-VII:

[

CV IIN+1

 

]

mn

 

= cos

(

 

(2m+1)nπ

2N+1

 

)

, m, n = 0, . . . , N ;

 

DST-VI:

[

SV IN

 

]

mn

 

= sin

(

 

(m+1)(2n+1)π

2N+1

 

)

, m, n = 0, . . . , N − 1;

 

DST-VII:

[

SV IIN

 

]

mn

 

= sin

(

 

(2m+1)(n+1)π

2N+1

 

)

, m, n = 0, . . . , N − 1;

 

In the above definitions, we have intentionally omitted normalization constants (such as

 

2/N and

λi = (i = 0)?1/

 

2 : 1 conventionally used in DCT-II) as they don’t affect factorization structures of the

 

transforms. Sub-indices N or N + 1 indicate lengths of transforms. We follow Z.Wang and B.R.Hunt’s

convention coupling N -point DST-VI/VII with N+1-point DCT-VI/VII [20]. As easily noticed, transforms

of types II and III, as well as VI and VII are closely related:(

 

CIIN

)T

 

= CIIIN ;

(

CV IN+1

 

)T

= CV IIN+1;

 

(

SV IN

 

)T

= SV IIN .

 

B. Relationship between 2N + 1-point DCT-II, N + 1-point DCT-VI, and N -point DST-VII transforms

In this section we prove the following statement.

 

Theorem 1. The following holds:

 

CII2N+1 = P2N+1

 

⎝ CV IN+1

 

SV IIN

 

⎝ IN JN1

 

−JN IN

 

⎠ (1)

 

where P2N+1 is a matrix, such that when applied to a vector x, it produces the following sign alterations

and reordering:

 

x˜2i = xi i = 0, . . . , N,

x˜2i+1 = (−1)ixN+1+i i = 0, . . . , N − 1, (2)

 

and IN and JN are N ×N identity and order-reversal matrices respectively.

 

17

 

 

 

yN

 

yN-2

 

yN-1

 

y0

 

yN-1

 

y1

 

yN-2

 

y1

 

y0

 

Y1

 

Y0

 

Y2

 

xN

 

xN+1

 

xN-2

 

x0

 

x2N-1

 

x2N

 

x1

 

xN-1

 

xN+2

 

YN-2

 

Y1

 

YN

 

YN-1

 

YN-1

 

Y0

 

X0

 

X2

 

X4

 

X2N-2

 

X2N

 

X1

 

X3

 

X2N-1

 

X2N-3

 

YN-3 X2N-5x2N-2

 

Y2

X5

 

y2x2

 

YN-2 X2N-2

 

N-point 

 

DCT-II

 

N-point 

 

DCT-II

 

x0

 

N-point 

 

DCT-II

 

N-point 

 

DCT-II

 

N-point 

 

DCT-II

 

2N+1-point 

 

DCT-II

 

2N+1-point 

 

DCT-II

 

2N+1-point 

 

DCT-II

 

P

o

 

s

t

 

-

a

 

d

d

 

it

io

 

n

s

 

 

a

 

n

d

 

 

o

 

u

t

 

p

u

 

t

 

 

in

d

 

e

x

 

 

m

 

a

p

 

p

in

 

g

 

N-point 

 

DST-VII

 

N-point 

 

DST-VII

 

N-point 

 

DST-VII

 

x1

 

xN(2N+1)-2

 

xN(2N+1)-1

 

X0

 

X1

 

xN(2N+1)-2

 

xN(2N+1)-1

 

I

n

 

p

u

 

t

 

 

in

d

 

e

x

 

 

m

 

a

p

 

p

in

 

g

 

N+1 – point 

 

DCT-VI

 

N – point 

 

DST-VII

 

-

 

-

 

-

 

-

 

-

 

-

 

-

 

-

 

X2

 

xN(2N+1)-3

 

x2

 

xN(2N+1)-3yN-3

 

Fig. 4. Computing DCT-II of composite sizes: (a) split of 2N + 1-point DCT-II into an N + 1-point DCT-VI and N -point DST-VI; (b)

computation of DCT-II of length N(2N + 1).

 

Proof: Let us consider a 2N + 1-long input sequence x = x0, . . . , x2N , and apply DCT-II over it:

 

XCIIk =

2N∑

n=0

 

xn cos

π(2n+ 1)k

 

2(2N + 1)

, k = 0, . . . , 2N .

 

We first look at even output values (k = 2i, i = 0,. . . ,N):

 

XCII2i =

2N∑

n=0

 

xn cos

π(2n+ 1)2i

 

2(2N + 1)

=

 

2N∑

n=0

 

xn cos

π(2n+ 1)i

 

2N + 1

.

 

We can split this sum as follows:

 

XCII2i =

N∑

 

n=0

 

xn cos

π(2n+ 1)i

 

2N + 1

+

 

2N∑

n=N+1

 

xn cos

π(2n+ 1)i

 

2N + 1

 

=

N∑

 

n=0

 

xn cos

π(2n+ 1)i

 

2N + 1

+

 

2N∑

n=N+1

 

xn cos

π(2(2N + 1)− (2n+ 1))i

 

2N + 1

 

=

N∑

 

n=0

 

xn cos

π(2n+ 1)i

 

2N + 1

+

 

2N∑

n=N+1

 

xn cos

π(2(2N − n) + 1)i

 

2N + 1

 

=

N∑

 

n=0

 

xn cos

π(2n+ 1)i

 

2N + 1

+

 

N−1∑

n=0

 

x2N−n cos

π(2n+ 1)i

 

2N + 1

,

 

18

 

 

 

which implies that

 

XCII2i = X

CV I

i +X

 

′CV I

i

 

where XCV I is a DCT-VI transform over the first N + 1 elements of input sequence x, and X ′CV I is a

DCT-VI transform over the following input:

 

x′n =

[

x2N−n, if n = 0, . . . , N − 1,

0, if n = N.

 

We now turn out attention to the odd output values (k = 2i+ 1, i = 0, . . . , N − 1):

 

XCII2i+1 =

2N∑

n=0

 

xn cos

π(2n+ 1)(2i+ 1)

 

2(2N + 1)

 

= (−1)i

2N∑

n=0

 

xn sin

π(2n+ 1 + 2N + 1)(2i+ 1)

 

2(2N + 1)

 

= (−1)i

2N∑

n=0

 

xn sin

π(n+N + 1)(2i+ 1)

 

2N + 1

 

= = (−1)i+1

2N∑

n=0

 

x2N−n sin

π(N − n)(2i+ 1)

 

2N + 1

 

We can split this sum as follows:

 

XCII2i+1 = (−1)i+1

[

N−1∑

n=0

 

x2N−n sin

π(N − n)(2i+ 1)

 

2N + 1

+

 

2N∑

n=N+1

 

x2N−n sin

π(N − n)(2i+ 1)

 

2N + 1

 

]

 

= (−1)i+1

[

N−1∑

n=0

 

xN+1+n sin

π(n+ 1)(2i+ 1)

 

2N + 1

 

N−1∑

n=0

 

xN−1−n sin

π(n+ 1)(2i+ 1)

 

2N + 1

 

]

,

 

which implies that

 

XCII2i+1 = (−1)i+1 X˜SV IIi + (−i)i XˆSV IIi

where X˜SV II is an N -point DST-VII transform over a sequence

 

x˜n = xN+1+n, n = 0, . . . , N − 1,

and XˆSV II is an N -point DST-VII transform over a sequence

 

xˆn = xN−1−n, n = 0, . . . , N − 1.

By combining all these mappings we arrive at expression (1).

 

We present a flowgraph of the resulting mapping between DCT-II, DCT-VI, and DST-VII transforms

in Figure 4.a. Only permutations and sign changes are needed to convert output of DCT-VI, and DST-VII

into DCT-II.

 

19

 

 

 

��

 

��

 

��

 

��

 

��

 

��

 

 

 

 

��

 

��

 

��

 

��

 

��

 

 

 

 

 

 

 

��

 

��

 

 

 

 

��

 

 

 

 

 

 

 

 

 

 

���

 

 

���

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

( )������ π−

 

( )������ π−

( )������ π−

 

( )����� π−

 

( )������ π−

 

( )��

�� π

 

( )��

�� π

 

( )��

�� π

 

�������

���

 

 

��

 

��

 

��

 

 

��

 

��

 

��

 

��

 

 

 

��

 

��

 

��

 

��

 

��

 

��

 

��

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

���

 

 

���

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

( )����� π

 

( )����� π

( )����� π

 

( )���� π

( )����� π

 

( )��

�� π

 

( )��

�� π

 

( )��

�� π

 

 

�������

����

 

 

 

 

 

 

 

 

 

 

 

 

 

�������

����

 

 

 

 

 

 

 

� �

 

 

 

 

 

 

 

 

 

 

 

 

 

 

�������

��

 

������

 

Fig. 5. Flow-graph of Winograd’s factorization of DFT of length 9, and flow-graphs of 9-point DCT-II, 5-point DCT-VI, and 4-point

DST-VII implied by mappings (1,3).

 

C. Fast algorithms for computing DCT-II, DCT-VI, and DST-VII

From (1) we already know that fast computation of DCT-VI and DST-VII can be reduced to computing

 

subsets of DCT-II. In turn, thanks to Heideman[8] it is also known that computation of DCT-II of odd

numbers is equivalent to computing same-length DFT. Considering length 2N + 1-transforms, we can

summarize Heideman’s result as follows:

 

CII2N+1 = H1

 

(

[�(F2N+1)]rows 0,...,N

[�(F2N+1)]rows N+1,...,2N

 

)

H2, (3)

 

where � (F2N+1) and � (F2N+1) denote real and imaginary parts of the DFT transform matrix of size

2N + 1, and H1 and H2 are permutation and sign-inversion matrices [8].

 

In combination with (1) this formula shows that an N + 1-point DCT-VI, an N -point DST-VII and an

2N + 1-point DCT-II can be computed by mapping to a 2N + 1-point DFT.

 

Many algorithms for fast computing of DFT are readily available [4]. For example, in case of N = 4,

we can use Winograd DFT module of length 9 shown in Figure 5.a. By using mappings (3) and (1) we

adopt it to produce 9-point DCT-II, 5-point DCT-VI, and 4-point DST-VII. This is shown in Figure 5.b.

 

We note that all the resulting algorithms are very efficient in terms of multiplicative complexity.

Thus, obtained 9-point DCT-II requires only 8 non-trivial multiplications. In contrast, the least complex

algorithms for computing DCT-II of size 8 (nearest dyadic-size) requires 11 multiplications[12]. The

obtained 4-point DST-VII is also very efficient: it uses only 5 multiplications. This DCT-VII factorization

is immediately suitable for implementing transform in HEVC video codec.

 

We show 4-point integer transforms adopted in HEVC standard in Figure 6. In same picture, we also

show flowgraphs of the the corresponding DCT-II and DST-VII transforms, and modified versions of these

flowgraphs allowing computation of integer transforms defined in HEVC standard.

 

D. Fast computation of transforms of lengths 2kN(2N + 1)

It is known that a transform of a composite length N = pq, where p and q are co-prime, can be

 

decomposed into a cascade of p q-point transforms and q p-point transforms followed by pq − p− q − 1

additions. This class of techniques is called Prime Factor Algorithms (PFA) [11], [22].

 

In Figure 4.b, we show how to compute DCT-II of length N(2N + 1). This factorization includes

2N + 1 N -point DCT-II sub-transforms, and additionally N 2N + 1-point DCT-II transforms, which, in

 

20

 

 

 

(a)                                                                                (b)                                                                                            (c)

 

0 0

 

1 1

 

2 2

 

3 3

 

~128 4

 

64 64 64 64

83 36 36 83

64 64 64 64

36 83 83 36

 

IIC

 

X x

X x

X x

X x

 

×

 

⎛ ⎞ ⎛ ⎞⎛ ⎞

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

− −

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

= ×

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

− −

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

− −⎝ ⎠⎝ ⎠ ⎝ ⎠

 

�����������

 

-

 

-

 

( )31 82 cos π

( )31 82 sin π−( )31 82 sin π

 

( )31 82 cos π

 

0x

 

1x

 

2x

 

3x

 

1

2

 

1

2

 

0X

 

1X

 

2X

 

3X

 

(d)                                                                                (e)                                                                                            (f)

 

4

 

0 0

 

1 1

 

2 2

 

3 3

 

~128

 

29 55 74 84

74 74 0 74

84 29 74 55

55 84 74 29

 

VIIS

 

X x

X x

X x

X x

 

×

 

⎛ ⎞ ⎛ ⎞⎛ ⎞

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

= ×

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

− −

 

⎜ ⎟ ⎜ ⎟⎜ ⎟

− −⎝ ⎠⎝ ⎠ ⎝ ⎠

 

�����������

 

-

 

-

 

-

 

-

 

-

 

-

 

84

 

74

 

74

 

29

 

55

 

0X

 

1X

 

2X

 

3X

 

0x

 

1x

 

2x

 

3x

 

-

 

-

 

-

 

-

 

-

 

-

 

0X

 

1X

 

2X

 

3X

 

0x

 

1x

 

2x

 

3x

( )2 299 sin π

 

( )2 239 sin π

( )2 199 sin π

 

( )2 499 sin π

 

( )2 239 sin π

 

-

 

-

 

83−83

 

36

 

0x

 

1x

 

2x

 

3x

 

0X

 

1X

 

2X

 

3X

 

36

 

64

 

-

 

64

 

Fig. 6. Factorizations of 4-point HEVC transforms: (a) and (d) show integer transform matrices defined by the standard, (c) and (f) show

factorizations of DST-VII and DCT-II transforms, (b) and (e) show how same factorizations can be applied to compute integer transforms.

 

turn include N -point DST-VII as part of their flowgraph. Hence, a system that implements and uses N -

point DCT-II and DST-VII, can easily compute an N(2N +1) transform by reusing them. Same principle

more generally applies to computing transforms of lengths 2kN(2N + 1).

 

We note, that in addition to ability to reuse N -point DCT-II and DST-VII, transforms of lengths

2kN(2N + 1) may be considerably faster than transforms of nearest dyadic sizes. We illustrate this

in Table 1. Here, we show complexities of factorizations of DCT-II of lengths 36 and 72 and show that

they are less complex (in terms of multiplications) that fastest known (Feig-Winograd [6]) algorithms for

computing DCT-II of lengths 32 and 64 respectively.

 

TABLE II

COMPARISON OF SEVERAL EXISTING AND PROPOSED FACTORIZATIONS OF DCT AND DST TRANSFORMS.

 

Transform N Algorithm Features Complexity Matrix-vector product

DCT-II 4 Loeffler, et al. [12] 4 muls, 9 adds 16 muls, 12 adds

DST-VII 4 this paper, [5] 5 muls, 11 adds 16 muls, 12 adds

DCT-VI 5 this paper 3 muls, 15 adds 25 muls, 20 adds

DCT-II 8 Loeffler, et al. [12] 11 muls, 29 adds 64 muls, 56 adds

DCT-II 9 this paper, Heideman [8] uses 4-point DST-VII 8 muls, 34 adds 81 muls, 72 adds

DCT-II 32 Feig,Winograd [6] 76 muls, 209 adds 1024 muls, 992 adds

DCT-II 36 this paper uses 4-point DST-VII 68 muls, 239 adds 1296 muls, 1260 adds

DCT-II 64 Feig,Winograd [6] 184 muls, 513 adds 4096 muls, 4032 adds

DCT-II 72 this paper uses 4-point DST-VII 163 muls, 587 adds 5184 muls, 5112 adds

 

IV. CONCLUSIONS

In this paper, we have provided an overview of the DCT/DST transform scheme for intra coding in the

 

HEVC standard. We have shown that such transforms allow efficient factorizations by utilizing mappings

between DST-VII, DCT-II, and DFT. We have also produced several results on joint computation of DCT-II

and DST-VII transforms of several sizes, which might be of interest for implementation of future video

coding algorithms.

 

REFERENCES

[1] G. Bjøntegaard, “Calculation of Average PSNR Differences between RD curves,” ITU-T SG16/Q6, VCEG-M33, April 2001.

 

21

 

 

 

[2] F. Bossen, “Common HM test conditions and software reference configurations,” ITU-T and ISO/IEC JCTVC-I1100, May 2012.

[3] V. Britanak, K. R. Rao, and P. Yip, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer

 

Approximations. Oxford, UK: Academic Press - Elsevier, 2007.

[4] C. Burrus and T. Parks, DFT/FFT and Convolution Algorithms Theory and Implementation. New York: Wiley, 1985.

[5] R. K. Chivukula and Y. A. Reznik, “Fast computing of discrete cosine and sine transforms of types vi and vii,” in Applications of

 

Digital Image Processing XXXIV, ser. Proc. SPIE, A. G. Tescher, Ed., vol. 8135, 2011, pp. 813 505–1–14.

[6] E. Feig and S. Winograd, “On the multiplicative complexity of discrete cosine transforms (corresp.),” IEEE Trans. Info. Theory, vol.

 

IT-38, pp. 1387 – 1391, 1992.

[7] J. Han, A. Saxena, and K. Rose, “Towards jointly optimal spatial prediction and adaptive transform in video/image coding,” in IEEE

 

ICASSP, March 2010, pp. 726–729.

[8] M. T. Heideman, “Computation of an odd-length DCT from a real-valued DFT of the same length,” IEEE Trans. Signal Processing,

 

vol. 40, no. 1, pp. 54–61, 1992.

[9] J. H. Min, S. Lee, Il-Koo Kim, Woo-Jin Han, J. Lainema and K. Ugur, “Unification of the directional intra prediction methods in

 

TMuC,” ITU-T and ISO/IEC JCTVC-B100, July 2010.

[10] A. K. Jain, “A sinusoidal family of unitary transforms,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 4,

 

pp. 356 –365, Oct. 1979.

[11] B. G. Lee, “Input and output index mapping for a prime-factor decomposed computation of discrete cosine transform,” IEEE Trans.

 

Acoust., Speech, Signal Processing, vol. 37, no. 2, pp. 237 – 244, Feb. 1989.

[12] C. Loeffler, A. Ligtenberg, and G. Moschytz, “Practical fast 1-D DCT algorithms with 11 multiplications,” in IEEE Int. Conf. Acoust.,

 

Speech, Signal Processing (ICASSP), vol. 2, May 1989, pp. 988–991.

[13] A. Saxena and F. Fernandes, “Mode Dependent DCT/DST for intra prediction in block-based image/video coding,” in IEEE ICIP, Sept

 

2011, pp. 1721–1724.

[14] ——, “On secondary transforms for intra prediction residual,” in IEEE ICASSP, March 2012, pp. 1201–1204.

[15] ——, “Jointly optimal intra prediction and adaptive primary transform,” ITU-T and ISO/IEC JCTVC-C108, Oct 2010.

[16] ——, “Mode-dependent DCT/DST for intra-prediction in video coding,” ITU-T and ISO/IEC JCTVC-D033, Jan 2011.

[17] ——, “Mode-dependent DCT/DST without 4*4 full matrix multiplication for intra prediction,” ITU-T and ISO/IEC JCTVC-E125,

 

March 2011.

[18] K. Ugur and O. Bici, “Performance evaluation of DST in intra prediction,” ITU-T & ISO/IEC JCTVC-I0582, May 2012.

[19] K. Ugur and A. Saxena, “Summary report of Core Experiment on intra transform mode dependency simplifications,” ITU-T & ISO/IEC

 

JCTVC-J0021, July 2012.

[20] Z. Wang and B. R. Hunt, “The discrete W transform,” Applied Mathematics and Computation, vol. 16, no. 1, pp. 19 – 48, 1985.

[21] S. Winograd, “On computing the Discrete Fourier Transform,” Proc. National Academy of Sciences, USA, vol. 73, pp. 1006–1016,

 

1976.

[22] P. Yang and M. Narasimha, “Prime factor decomposition of the discrete cosine transform,” in IEEE Int. Conf. Acoust., Speech, Signal

 

Processing (ICASSP), March 1985, pp. pp. 772–775.

 

22