The Vault
High Order Integer Transform Design for Video Compression
Research Paper / Jan 2012
> TCSVT <
1
Abstract—Recent research shows that using a largesize block as the coding unit (CU) and increasing the motioncompensated
prediction (MCP) and transform sizes accordingly can significantly improve the ratedistortion (RD) performance especially for high
definition (HD) video coding. The trend of using large size transforms raises the interest in seeking high performance high order
integer transforms with low computation requirements. This paper presents a set of integer transforms with the sizes 4×4, 8×8, 16×16,
and 32×32. They have high energy compaction capability, almost equal L2 norms of the basis vectors, and are fully factorizable, which
guarantees the development of fast algorithms. Compared with the prior work, the proposed integer transforms have lower complexity
and more flexibility for implementation while providing comparable coding performance.
Index Terms—Integer transform, core transform, HEVC, video coding
I. INTRODUCTION
RANSFORM coding is a fundamental part in video compression systems. It decorrelates the samples to be quantized so that
the commonly used scalar quantization over individual values does not cause too much coding efficiency loss in comparison
with vector quantization. Transform coding also compacts the energy of the samples into a few transform coefficients to
facilitate efficient entropy coding.
The performance of a transform is usually evaluated by transform coding gain (TCG) [1], the maximum gain of transform
coding over Pulsecode Modulation (PCM) coding. TCG is determined by transform type, transform size, and the statistics of the
input signal, e.g., correlation. Regardless of input signal’s statistics, largersize transform provides higher TCG, and Karhunen
Loeve Transform (KLT) is the optimal transform, of which the basis vectors are derived based on the time and spacevariant
second order statistics of the input signal. Given the original video signal (or its residual by predictive coding) as the input,
where the correlation is relatively high, discrete cosine transform (DCT) [2] has basis vectors very close to KLT and is
considered as a fixed approximation of KLT; the ideal transform size should be the same as the frame size to fully exploit the
spatial correlation among pixels.
Usually, the gain of a real transform coder is smaller than the TCG of the transform it uses. TCG is achieved only if the
quantization and entropy coding are both optimal, which, however, is impossible for a real transform coder. In order to fully
demonstrate the performance that the transform can provide, we should not only pursue high TCG while determining the type and
size of the transform, but also consider the practical limitations of the transform coder, including the typical statistics of the input
signal, the design of quantization and entropy coding, the framework of the video coding system upon which the transform coder
is built, and the complexity restrictions due to mainstream hardware capability. How these limitations influence the transform
design is introduced below.
First, the prevailing video compression systems use blockbased hybrid framework, which means each frame is divided into
nonoverlapping blocks and each block is the unit where the hybrid predictive coding and transform coding apply. To enable
coding unit (CU) level ratedistortion (RD) optimization and avoid encoding/decoding delay, the transform size should not be
larger than the CU size. Second, larger transforms generally suffer more from nonoptimal entropy coding, as it is much more
difficult to accurately estimate the nonzero transform coefficients’ positions. However, larger transforms benefit the high and
ultrahigh definition videos, of which the spatial correlation is very high and homogeneous regions are dominant [3]. Larger size
transforms compact the highly correlated signals into fewer transform coefficients distributed in the low frequency domain. The
positions of these transform coefficients are easily modeled and efficiently entropy coded. Third, larger transform has higher
computational complexity. 2D order2N transform requires four times of the arithmetic operations of 2D orderN transform,
whereas the performance is not improved that much. Last but not the least, integer transforms are preferred to realvalued
transforms. Realvalued transform, such as DCT, is of high computational complexity and can cause the encoder and decoder
mismatch, even if implemented using high precision. Integer transform, such as Integer Cosine Transform (ICT) [4], provides
bitexact implementation and significant complexity reduction, and, if well designed, achieves almost the same TCG as DCT. In
Manuscript received Sept. 20, 2012.
The authors are with the CTO office, InterDigital Communications LLC, San Diego, CA (email: jie.dong@interdigital.com; yan.ye@interdigital.com).
High Order Integer Transform Design for
Video Compression
Jie Dong, Member, IEEE, and Yan Ye, Member, IEEE
T
> TCSVT <
2
recent years, integer transform superseded DCT, and has been adopted in modern video coding standards, such as H.264/AVC
[5], VC1 [6], and HEVC [7].
The core transform design for high efficiency video coding (HEVC) [7], the emerging video coding standard developed by the
Joint Collaborative Team on Video Coding (JCTVC) of ITUT/SG16/Q.6/VCEG and ISO/IEC/MPEG, takes all the above
factors into consideration. HEVC is focused on coding high definition (HD) and ultra HD video, in which the area representing
certain content becomes larger than that in lower resolution video. Therefore, compared to existing video coding standards that
primarily use blocks of 16×16 pixels as coding units, HEVC enlarges the CU size up to 2
K
×2
K
(K is a positive integer set in the
sequence parameter set) to save the overhead, such as CU type and motion vector (MV), and scales the prediction and transform
sizes accordingly. The core transform design adopts 2D order16 and order32 integer transforms mainly for smooth regions and
smaller order4 and order8 transforms mainly for detailrich regions.
The trend of using large size transforms raises the interest of seeking high performance high order integer transforms with low
computation requirement with the following desirable features. First, to provide the performance as good as the suboptimal
transform DCT, the basis vectors should resemble DCT’s basis vectors, including small DCT distortion [8] and orthogonality.
Second, the L
2
norms of the basis vectors are (almost) equal, in order to save the additional scaling process used to correct the
different norms of the basis vectors. Third, the transform matrix has an intrinsic symmetry structure, such that the multiplication
with the transform matrix can be fully factorized to enable fast algorithms.
Finding a high order integer transform satisfying all the conditions above is very challenging, and tremendous research effort
has been dedicated to this topic [3] [9][16]. Some work [3] [9][11] is more focused on good energy compaction capability, i.e.,
less DCT distortion, and derives the transform matrices by scaling the DCT matrix elements by a factor and then truncating the
real numbers to the nearby integers based on certain constraints, such as equal L
2
norms or orthogonality. This approach is
applicable for designing an arbitrary order integer transform, but cannot maintain the intrinsic symmetry structure in the DCT
matrix that guarantees the fast algorithm. Other work [3] [12][16] starts from the intrinsic symmetry structure to ensure full
factorization and fast algorithm; however the intrinsic symmetry structure can impose significant constraint when seeking the
basis vectors with high energy compaction capability. For example, with the proposed symmetry structures in [3], [12], and [13],
no integer transform solution with small DCT distortion can be found. The work in [16] is so far the closest to the ideal integer
transform that satisfies all the desired merits; however the cost is that the magnitudes of the transform matrix elements are very
large, i.e., represented by 14 bits, and the forward order32 transform cannot be realized even by 32bit integer arithmetic.
In this paper, we propose approaches to derive order4, order8, order16, and order32 integer transforms, which satisfy all
the desired merits except orthogonality. The symmetry structure of the transform matrix that guarantees the fast algorithm is first
developed. With this symmetry structure, the elements in the transform matrices are carefully selected, such that the basis vectors
have almost the same L
2
norms and very small DCT distortion, and the transform error introduced by nonorthogonality is
negligible. Four specific integer transforms with the sizes 4×4, 8×8, 16×16, and 32×32 are proposed as the core transform design
for HEVC. Compared with the core transform adopted in HEVC [11] and other contributions [15][16], the proposed integer
transforms have lower complexity and more flexibility for implementation, while providing comparable coding performance.
The remainder of the paper is organized as follows. Section II presents the general matrices of the proposed integer
transforms. In Section III, a specific set of integer transforms are designed based on these general matrices, and proposed as the
core transform for HEVC. Section IV reports the experimental results, followed by the conclusion in Section V.
II. GENERAL MATRICES OF THE PROPOSED INTEGER TRANSFORM
The DCT transform matrix can be fully factorized, which ensures the development of fast algorithm. Many factorization
algorithms have been proposed in the literature, which exploit different symmetry structures of the DCT matrix. Some
algorithms [19] are not every efficient in terms of arithmetic operations, but can be systematically generalized to any high order
DCT. Some algorithms require fewer computations, but are only applicable for certain order DCTs. For example, the fast
algorithm proposed in [18] is only for order8 and order16 DCTs. In this paper, the proposed integer transforms retain the
symmetric structures of DCT, such that they can be factorized in the same way. Like DCT, the proposed 2D integer transform is
separable, which can be accomplished by applying two 1D integer transforms sequentially. Therefore, in the following, we only
discuss the 1D integer transform design.
A. EvenOdd Part Decomposition
Denote an order2N DCT matrix as
( is a positive integer). The basis vectors of
have alternating even
and odd symmetries and thus can be decomposed to even and odd parts, where the even part is identical to
scaled by a
factor
√
, as shown in (1).
[
√
] [
] (1)
> TCSVT <
3
In (1),
, an N×N matrix, is the odd part of
, is the orderN identity matrix and is the opposite diagonal identity
matrix, obtained by flipping the columns of horizontally. This property is kept in the integer transform design, as shown in
(2),
[
] [
] (2)
where is the transform matrix of the order2N integer transform and its even and odd parts are denoted as and ,
respectively. To construct a larger integer transform in such a recursive process, , the initial 2×2 matrix, is defined as in (3).
Note that the scaling factor is used to ensure the L
2
norms of the basis vectors in are (almost) equal.
[
] (3)
Denote [ ] as the input signal to the order2N transform and [ ] as the output. The
vector comprising of the even points of , denoted as [ ], is determined only by the even part ;
likewise, the vector comprising of the odd points, [ ], is determined only by the odd part.
As the even part of the transform matrix is constructed in such a recursive way, the problem of integer transform design is
narrowed down to designing the odd part. In order4, order8, order16, and order32 DCT matrices, the odd parts are
represented by (4) to (7), respectively, in which the values of the matrices’ elements are shown in TABLE I.
[
] (4)
[
] (5)
[
]
(6)
[
]
(7)
TABLE I
VALUES OF THE ELEMENTS IN ORDER4, ORDER8, ORDER16, AND ORDER32 DCT MATRICES
[ ] [
√
√
]
[ ] [
]
[ ] [
√
√
√
√
……
√
√
]
[ ] [
……
]
> TCSVT <
4
In integer transform design, the general matrix of the odd part is borrowed from DCT with the matrix elements’ relative
magnitudes unchanged. A simple way to obtain the odd part of an integer transform is to scale the elements in the same order
DCT matrix by a factor, and truncate the scaled real numbers to the nearby integers based on certain constraints. However, this
approach cannot maintain the intrinsic symmetry structure in the DCT matrix that guarantees the fast algorithm. In Sections II.B,
II.C, and II.D, we propose approaches to deriving the odd parts of integer transforms from order4 up to order32, where DCT
matrix’s intrinsic symmetry structure is incorporated. Note that, the odd part of order4 integer transform does not need further
factorization, since its size is 2×2, as show in (4).
B. Odd Part of Order8 Integer Transform
The odd part of order8 DCT, as shown in (5), can be rewritten as in (8), only if the elements [ ] are defined as in
the second row of TABLE I [18], and (8) can be factorized to three sparser matrices. The equivalence of (5) and (8) does not
hold, when the realvalued numbers [ ] are replaced with integers. The reason is obvious because (5) becomes an
integer matrix, whereas the first and fourth rows of (8) still have realvalued elements. This means that the integer transform
matrix obtained by scaling [ ] by a factor and rounding cannot be factorized in the same way as DCT matrix.
[
√
√
√
√
√
√
√
√
]
(8)
In order to keep the factorizable structure in (8) and provide more freedom in searching good matrix elements that compose of
DCTlike basis vectors, we generalize (8), and propose the odd part of order8 integer transform and its factorized form in (9)
and (10), respectively,
[
] (9)
[
] [
] [
] (10)
where the elements [ ] are all integers. To make resemble
in terms of the relative magnitudes among
matrix elements, the following three constraints should be imposed.
C. Odd Part of Order16 Integer Transform
The derivation of the odd part of order16 integer transform is similar to that for the order8 one. Based on [18], the odd part
of order16 DCT shown in (6) is rewritten as in (11).
[
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
]
(11)
Extending (11) to a more general matrix that is also applicable for integer elements, we propose the odd part of order16 integer
transform and its factorized form in (12) and (13), respectively, where [ ] are all integers.
Relation 1
1. ⁄ ⁄ ⁄ and ⁄ ⁄ ⁄
2. ⁄ ⁄ ⁄
3. and
> TCSVT <
5
[
]
(12)
88
88
88
88
88
88
88
88
88
88
88
8
8
88
88
88
8
000000
000000
000000
000000
000000
000000
000000
000000
10010000
01100000
01100000
10010000
00001001
00000110
00000110
00001001
10000010
01000001
00110000
00110000
00001100
00001100
10000010
01000001
000000
000000
000000
0000000
0000000
000000
000000
000000
de
fc
bg
ha
ah
gb
cf
ed
ll
ji
ij
k
k
ji
ij
ll
P
(13)
The following two conditions need to be satisfied to ensure the relative magnitudes among the matrix elements remain similar to
those in a DCT matrix.
D. Odd Part of Order32 Integer Transform
The derivation of the odd part of order32 integer transform and its factorization is inspired by Chen’s DCT fast algorithm
[19], as the LLM fast algorithm [18] used in Sections II.B and II.C are only for order8 and order16 DCT. Chen’s algorithm is
not the most efficient in terms of arithmetic operations, but provides a systematic approach to developing the fast algorithm for
order2
K
DCT, where K can be arbitrarily large. In [19], the odd part of an orderN DCT is factorized into ( ) matrices
with the size 16×16 and each matrix can be calculated by a closeform expression. Hence, the odd part of order32 DCT is
factorized to seven matrices, as expressed by (14).
(14)
In practice, transforming a 1D 16point signal by (14) is realized by seven serial stages with each stage accomplishing the
function of one matrix multiplication, and therefore, the latency is very high.
To reduce the latency, we propose to factorize the odd part of an order32 integer transform into five integer matrices, as
shown in (15),
(15)
where is shown in (16). In (16), and are exactly the same as and defined in [19], since they
are binary matrices; and keep the same sign distribution and relative magnitudes among elements as in and ,
respectively, as long as the following two conditions are satisfied.
can be expressed as the product of three integer matrices, , , and , as shown in (17), and (18)(24) shows the definitions
of the three matrices.
(17)
[
] (18)
Relation 3
1. √
2. (
) (
)
Relation 2
1.
2. (
) (
)
√
> TCSVT <
6
[
] [
] (19)
[
] [
] (20)
[
] [
] (21)
[
] [
] (22)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
2
1
2
1
1
2
1
2
0
0
0
0
2
1
2
1
1
2
1
2
0
0
1234
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
MMMM
(16)
> TCSVT <
7
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Y
(23)
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Z
(24)
and are both binary matrices with only one nonzero element in each row/column, meaning they are used for reordering
the 1D 16point data being processed in the serial stages (15). is the only effective matrix in (17), and multiplying with can
be realized by multiplying with four 4×4 integer matrices, which in total reduces the arithmetic operations to one fourth.
Actually, can be further factorized into three sparser matrices, such that is fully factorized to seven matrices as for
in
(14). By doing this, the computation requirement is further reduced, but the latency becomes unacceptable for practical
implementation. Therefore, factorizing by (15) is proposed to balance these two aspects. The intermediate numbers
[ ], as in (19) to (22), should satisfy the following two relations, in order to keep the relative
magnitudes of DCT unchanged.
1.
2. (
) (
) (
) (
)
Consequently, the actual integer elements [ ], [ ], [ ], and
[ ] in , , , and , respectively, have the following four relations.
Relation 4
1.
2.
3.
4.
> TCSVT <
8
III. PROPOSED CORE TRANSFORM DESIGN FOR HEVC
The proposed core transform design for HEVC includes a set of order4, order8, order16, and order32 integer transforms.
They use the factorizable general transform matrices as presented in Section II, and thus the fast algorithms can be easily
developed. This section introduces how to determine the values of the variables used to represent the general matrices.
To ensure that the energy compaction capability of the integer transform is comparable to DCT, the relative magnitudes
among the transform matrices’ elements should approximate those of DCTs, as specified by Relations 1 to 4 in Sections II.B to
II.D. We first scale the realvalued numbers on the right side of the Relations by certain factors, and then use the scaled values as
the starting points to search the integers in a certain neighborhood. The best set of the integers for the left side of the Relations
achieves the balance of three important aspects: the variance of the L
2
norms of the basis vectors, the deviation from DCT, and
the transform error introduced by nonorthogonality.
Given the transform matrix of an orderN integer transform, denoted as , the transformation process is shown in (25), where
and represent the input and output signals, respectively.
(25)
The L
2
norm of the ith basis vector of is calculated by (26), and the L
2
norms of all the basis vectors are denoted as a vector
[ ]
.
√∑
(26)
For energy conservation, should be normalized by elementwise dividing by a scaling matrix (
, which is an
additional process in comparison to DCT. However, when the L
2
norms are very close or even equal to each other, becomes a
constant matrix, and the elementwise matrix division can be approximated by a scalar matrix division, which can be combined
with the following scalar quantization. Hence, making the L
2
norms almost equal can save the normalization step and is a highly
desired feature in integer transform design.
Deviation from DCT is measured by DCT distortion [8], calculated by (27),
‖
‖
(27)
where denotes the main diagonal of matrix , and is the normalized transform matrix, obtained by (28).
(28)
This criterion reflects the signal energy that leaks out of DCT subbands. A smaller value of DCT distortion means less deviation
from DCT and thus implies less performance degradation compared with DCT.
Orthogonality is a strong condition in designing integer transforms. The magnitudes of the elements in the transform matrix
have to be very large especially for high order transforms to fulfill orthogonality, and can become even larger if other constrains
are imposed at the same time. If orthogonality is not fulfilled, the average energy of the reconstruction error in the spatial domain
is larger than that of the quantization error in the transform domain, and the difference is known as transform error, denoted as
. In this paper, orthogonality is not one of the design constraints, but the elements of the transform matrices are well selected,
such that the transform error
is negligible. According to [3], the transform error of an orderN integer transform can be
calculated by (29),
∑ ∑  
(29)
where
is the variance of the input source, and is the error matrix, defined by (30).
(30)
We conducted bruteforce search with all the aforementioned aspects taken into account, and the best solution, as shown in
TABLE II, was proposed as the core transform design for HEVC [17]. The specific odd parts, obtained by substituting the
parameter values in TABLE II into the general matrices introduced in Section II, are shown in (31) to (34). Fig. 1 shows the data
flow of the fast algorithm for the forward order32 integer transform. Since the orderN integer transform is used as the even part
of the order2N one, as introduced in Section II, its fast algorithm is also shown as part of the fast algorithm of order2N
transform in the same recursive manner.
[
] (31)
> TCSVT <
9
[
] (32)
[
]
(33)
[
]
(34)
IV. EXPERIMENTAL RESULTS
The proposed core transform design, as presented in Section III, was integrated into HEVC’s test model HM4.0 [20], using
16bit integer arithmetic, and tested under the common test conditions [21] and the special test conditions defined by the core
experiment group 10 (CE10) [22]. The common test conditions are mainly focused on HD and ultra HD video sequences; the
mandatory part includes five classes of video sequences in terms of resolutions, i.e., Class A 2560×1600, Class B 1080p, Class C
832×480, Class D 416×240, and Class E 720p. Each sequence is coded by three types of coding structures: all intra (AI), random
access (RA), and low delay (LD), and two configurations of coding tools: high efficiency (HE) and low complexity (LOCO).
With certain coding structure and coding tool configuration, a sequence is coded at four QPs (22, 27, 32, and 37) to generate an
operational RD curve over a wide range of bitrate, and the average bitrate reduction compared with HM4.0, known as BDrate
[23], is calculated and used to measure the RD performance of a proposed coding tool. The special test conditions defined by
CE10 require testing the RD performance at low QPs (1, 5, 9, and 13) and high QPs (36, 42, 47, and 51) to see if there is
unusual behavior at these uncommon test points.
TABLE II
PARAMETER VALUES FOR THE GENERAL TRANSFORM MATRICES PROPOSED AS THE CORE TRANSFORM DESIGN FOR HEVC
Order4 167 70
Order8 3 2 5 1 144 99 72 99 1 2
Order16
60 57 53 46 39 28 18 5
10 22 24 17
Order32
7 5
13 5 12
1 5 31 32 23 19 26 21
31 13 28 8 16 30 11 27
8 21 23 31 27 2 32 16
30 19 25 10 29 31 5 13
128 2 4 2
> TCSVT <
10
With the test conditions LD and HE, the BDrate of the proposed core transform design, together with the BDrates of other
CE10related contributions discussed in JCTVC, is shown in TABLE III. For the performance under other test conditions,
interested readers are referred to [11], [15], [16], and [17] for detailed data. As can be seen, the performance among all the core
transform designs for HEVC is less than 1% BD bitrate difference, no matter what the QP is. Similar results can be observed
under other test conditions. Therefore, the features that the existing core transform designs have and the number of arithmetic
operations they require, which are compared in TABLE IV and TABLE V respectively, become important factors in evaluation.
99
99
144
144
99
99
5
5
12
12
12
12
5
5
55
5
5
5
5
5
5
5
5
5
5
5
5
22
22
10
22
22
10
17
17
1760
18
53
39
53
39
57
57

23
5
5
3
1
2
167
167
18
60






















70
70

>>1
>>1
46
28
5
28
5
46








24
24
17
10
10
<<1
<<1
<<1
<<1
<<2
<<2
<<2
<<2
<<2
<<2
<<2
<<2
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1
<<1








7
7
7
7
7
7
7
7
5
5
5








13
13
13
13
5
5
12
12
12
12
5








n0
n1
n2
n3
n4
n5
n6
n7
n8
n9
n10
n11
n12
n13
n14
n15
n31
n30
n29
n28
n27
n26
n25
n24
n23
n22
n21
n20
n19
n18
n17
n16
X1
X2
X3
X4
u0
u8
u16
u24
u4
u20
u12
u28
u10
u22
u14
u30
u2
u18
u26
u6
u5
u27
u21
u11
u13
u19
u29
u3
u1
u31
u17
u15
u9
u23
u25
u7
13
13
13
13
144
144
<<7
<<7
Fig. 1 Fast algorithm for the forward order32 integer transform with the fast algorithms for the forward order4, order8, and order16 integer transforms
embedded.
> TCSVT <
11
As shown in TABLE IV, all the core transform designs have small DCT distortion, and therefore their performance all
approaches that of the DCT’s, as evidenced in TABLE III. The core transform design adopted in HEVC cannot be fully
factorized, which means it does not have a fast algorithm, and thus the required arithmetic operations are much higher than other
designs especially for high order transforms (see TABLE V). In the other designs, Joshi’s [15] still has higher complexity,
because its basis vectors have quite different L
2
norms, which need to be corrected by an additional scaling process. Furthermore,
although the core transform design in [15] supports implementation by a fast algorithm, it does not have an equivalent realization
using matrix multiplication. This limits the flexibility of its implementation. Alshina’s design [16] has all the desired features.
However, the symmetry structure proposed in [16] is not as efficient as the structure proposed in this paper. As a result, it needs
many more multiplications for high order transforms and the magnitudes of the elements in the transform matrices are larger than
those in the proposed design. No matter what the transform size is, the elements are represented by 14 bits, and the forward
order32 transform cannot even be implemented using 32bit integer arithmetic. The core transform design proposed in this paper
achieves a good balance of RD performance and computation requirement, and allows more flexibility for implementation.
TABLE III
COMPARISONS OF THE EXISTING CORE TRANSFORM DESIGNS IN BDRATE REDUCTION (%)
Class Sequence
Normal QP Low QP High QP
Joshi’s Alshina’s Proposed Joshi’s Alshina’s Proposed Joshi’s Alshina’s Proposed
B
Kimono 0 0 0.1 0 0.2 0.2 0.3 0.3 0.1
ParkScene 0 0.1 0.1 0.1 0.1 0.1 0 0.4 0.1
Cactus 0 0 0 0.1 0.1 0.1 0.1 0.1 0
BasketballDrive 0 0 0.1 0.3 0 0.1 0 0.1 0
BQTerrace 0 0 0.1 0.2 0 0.1 0.2 0.3 0.5
C
BasketballDrill 0.1 0.3 0.2 0.2 0.1 0 0.3 0.1 0.2
BQMall 0 0.1 0 0.2 0 0.1 0.1 0.4 0.1
PartyScene 0 0 0.1 0.2 0 0.1 0.2 0 0.2
RaceHorses 0 0 0 0.1 0.1 0.1 0.1 0.1 0.3
D
BasketballPass 0 0 0.1 0.3 0 0 1.3 1.4 0.2
BQSquare 0.1 0 0 0.2 0 0 0.2 0.3 0.3
BlowingBubbles 0.1 0 0 0.2 0 0.1 0.1 0.4 0.6
RaceHorses 0 0 0.1 0.2 0.1 0 0.1 0.1 0.2
E
Vidyo1 0.2 0.1 0 0.4 0.1 0 1.9 0.6 0.1
Vidyo3 0.1 0.1 0.1 0.4 0 0.1 0.8 0.1 0.8
Vidyo4 0 0.2 0.3 0.4 0.1 0 0.4 0 0.6
Average 0 0.03 0.08 0.22 0.06 0.07 0.22 0.19 0.16
TABLE IV
COMPARISON OF THE FEATURES OFFERED BY EACH EXISTING CORE TRANSFORM DESIGN
Small DCT
distortion
Full
factorization
No scaling
process
Negligible
transform error
Flexible
implementation
HEVC core transform [7] x x x x
Joshi’s [15] x x x
Alshina’s [16] x x x x x
Proposed x x x x x
TABLE V
Numbers of Additions and Multiplications for NPoint 1D Transform ( N = 4, 8, 16, and 32 )
HEVC core
transform
Joshi’s Alshina’s Proposed
4point
Addition 8 9 9 9
Multiplication 6 3 3 3
8point
Addition 28 26 29 31
Multiplication 22 12 11 11
16point
Addition 100 72 81 93
Multiplication 86 36 31 21
32point
Addition 372 186 229 279
Multiplication 342 92 87 56
> TCSVT <
12
V. CONCLUSION
This paper presents a set of integer transforms with the sizes 4×4, 8×8, 16×16, and 32×32. They have high energy compaction
capability and almost equal L
2
norms of the basis vectors, and are factorizable, which guarantees the development of fast
algorithm. Compared with the prior work, the proposed integer transforms have lower complexity and more flexibility for
implementation while providing comparable coding efficiency.
REFERENCES
[1] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applications to Speech and Video. Englewood Cliffs, NJ: PrenticeHall, 1984.
[2] N. Ahmed, T. Natarajan, K.R. Rao, “Discrete cosine transfom,” IEEE Transactions on Computers, vol. 23, no. 1, pp. 9093, Jan. 1974.
[3] J. Dong, K. N. Ngan, C. K. Fong, and W. K. Cham, “2D order16 integer transforms for HD video coding,” IEEE Trans. Circuits Syst. Video Technol., vol.
19, no. 10, pp. 1463–1474, Oct. 2009.
[4] W. K. Cham, “Development of integer cosine transforms by the principle of dyadic symmetry,” IEE Proc., Part I, vol. 136, no. 4, pp. 276–282, Aug. 1989.
[5] Advanced Video Coding for Generic Audiovisual Services, ITUT Rec. H.264 and ISO/IEC 1449610 AVC, Mar. 2005.
[6] S. Srinivasan et al., “Windows Media Video 9: overview and applications,” Signal Process.: Image Commun., pp. 851875, 2004.
[7] B. Bross, W.J. Han, G. J. Sullivan, J.R. Ohm, and T. Wiegand, “High efficiency video coding (HEVC) text specification draft 7”, document JCTVC
I1003, Joint Collaborative Team on Video Coding, May 2012.
[8] M. Wien and S. Sun, “ICT comparison for adaptive block transforms,” document VCEGL12, ITUT SG16/Q6, Jan. 2001.
[9] W. K. Cham and Y. T. Chan, “An order16 integer cosine transform,” IEEE Trans. Signal Process., vol. 39, no. 5, pp. 12051208, May 1991.
[10] P. Chen, Y. Ye, and M. Karczewicz, “Video coding using extended block sizes,” document T09SG16C0123, ITUT Q.6/SG16, Geneva, Switzerland,
Jan. 2009.
[11] A. Fuldseth, G. Bjøntegaard, and M. Budagavi, “CE10: core transform design for HEVC,” document JCTVCG495, Joint Collaborative Team on Video
Coding, Nov. 2011.
[12] S. Ma and C.C. Kuo, “Highdefinition video coding with supermacroblocks,” in Proc. SPIE Vis. Commun. Image Process., vol. 6508. Jan. 2007, pp.
6508161–65081612.
[13] C. K. Fong and W. K. Cham, “Simple order16 integer transform for video coding,” in Proc. IEEE Int. Conf. Image Process., pp. 161164, Sept. 2010.
[14] C. K. Fong and W. K. Cham, “LLM integer transform and its fast algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 6, pp. 844854, Jun.
2012.
[15] R. Joshi, J. Sole, and M. Karczewicz, “CE10: Scaled integer transforms supporting recursive factorization structure,” document JCTVCG579, Joint
Collaborative Team on Video Coding, Nov. 2011.
[16] E. Alshina et al., “CE10: full factorization core transforms for HEVC,” document JCTVCG737, Joint Collaborative Team on Video Coding, Nov. 2011.
[17] J. Dong and Y. Ye, “NonCE10: core transform design for HEVC”, document JCTVCG272, Joint Collaborative Team on Video Coding, Nov. 2011.
[18] C. Loeffler, A. Ligtenberg, and G. S. Moschytz, “Practical fast 1D DCT algorithms with 11 multiplications,” in Proc. IEEE Int. Conf. Acoust. Speech
Signal Process., vol. 2. May 1989, pp. 988–991.
[19] W. H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational algorithm for the discrete cosine transform”, IEEE Trans. Commun., vol. 25, no. 9,
pp.10041009, Sept. 1977.
[20] https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM4.0/.
[21] F. Bossen, “Common test conditions and software reference configurations”, document JCTVCF900, Joint Collaborative Team on Video Coding, Jul.
2011.
[22] P. Topiwala, M. Budagavi, R. Joshi, A. Fuldseth, I. Kim, “CE10: Core Transform Design,” document JCTVCF910, Joint Collaborative Team on Video
Coding, Jul. 2011.
[23] G. Bjontegaard, “Calculation of average PSNR differences between RDcurves,” document VCEGM33, ITUT SG16/Q6, Apr. 2001.