A High Speed Architecture for Lifting-based 2-D Cohen-Daubechies-Feauveau ( 5 , 3 ) Discrete Wavelet Transform used in JPEG 2000

For real-time applications, efficient VLSI implementation of DWT is desired. In this paper, DWT architecture based on retiming for pipelining and unfolding is presented. The architecture is based on lifting one-dimensional CohenDaubechies-Feauveau (CDF) (5,3) wavelet filter, which is easily extended to 2-D implementation. It consists of low complexity and easily repeatable components. This paper is focused on the critical path minimization and throughput optimization at the same time. The architecture has been implemented on Virtex 6 Xilinx FPGA platform. The implementation results show that the critical path is minimized four to five times, while throughput is doubled, making the overall architecture approximately ten times faster when compared with the conventional lifting-based DWT architecture. Further with parallel implementation, the throughput has doubled without any increase in number of row buffers, implying that the architecture is memory efficient as well. The even and odd rows of the image are scanned in parallel fashion. To perform the 2-D DWT transform of an image of size 15 Megapixels, it takes 16.86 ms, which implies 59 images of that size can be processed in one second. This can be utilized for real-time video processing applications even for high resolution videos.


I. INTRODUCTION
The discrete wavelet transform (DWT) has completely replaced the discrete cosine transformation (DCT) in image coding because it supports progressive image transformation, multi-resolution, ease of compressed image manipulation, region of interest coding, etc. Traditionally, DWT was implemented using convolution.Such an implementation requires both, a large number of computation and a large storage which are undesirable for any high speed or low power application.A new mathematical formulation that replaces the convolution-based wavelet transformation [1]- [4] has been proposed by Sweldens [5], [6], namely lifting-based wavelet transformation.The main feature of the lifting-based discrete wavelet transform scheme is to break up the high-pass and low-pass wavelet filters into a sequence of smaller filters that in turn can be converted into a sequence of upper and lower triangular matrices.The idea behind the lifting scheme is to use data correlation to remove the redundancy.Some of the advantages of this reformulation of the DWT includes "inplace" computation of the DWT, integer-to-integer wavelet Manuscript received October 7, 2016, revised January 31, 2017 M. Rafi is with the Department of Electronics and Communication Engineering, National Institute of Technology, Srinagar 190 006 India (phone: +91-9622-515357; e-mail: mohdrafi 11phd13@nitsri.net).
Najeeb-ud-Din is with the Department of Electronics and Communication Engineering, National Institute of Technology, Srinagar 190 006 India(e-mail: najeeb@nitsri.net).transform (IWT), symmetric forward and inverse transform, suitability of parallelism and many more [7].The lifting scheme decomposes every DWT operation into a sequence of lifting steps.The basic steps involved are splitting, predicting and updating [8].Further the filters may be classified into 2M consisting of one predict and one update step and 4M consisting of two predict and two update steps.
The state-of-the-art compression technique, JPEG2000, is striving for the development of efficient architecture of wavelet transform.Presently JPEG2000 uses Cohen-Daubechies-Feauveau (5,3) and (9,7) wavelet filters for lossless and lossy compression schemes respectively [9]- [11].CDF (5,3) wavelet filter is 2M based, while CDF (9,7) is 4M based; 2M consists of one predict and one update stage, while 4M consists of two predict and two update stages.(5,3) indicate that the number of highpass and lowpass filter taps are 5 and 3 respectively.Since (5,3) and (9,7) are used in JPEG2000, lot of work has been done for their efficient implementation.The parameters under consideration are: speed, throughput, computational complexity and memory reduction.A few papers have worked on reduction of critical path as well.
Advantages of lifting-scheme over convolution-based has been presented in [12] for wavelet (9,7).[13] has presented a survey on different VLSI architectures on lifting based DWT.Architectures for reduction of memory accesses and hardware complexity have been proposed in [14]- [17].Throughput optimization has always had a trade-off with architectural area.Few papers that have proposed architectures for throughput optimization are [16], [17].Speed can also be increased by reducing the critical path delay of a design.But critical path reduction is always obtained at the cost of increase in latency and number of registers.The papers that have worked on critical path minimization are [18], [19].The implementation of most of these architectures is based on FPGA using VHDL synthesis, however MATLAB/Simulink/Xilinx System Generator can also be utilized for the same [20], which helps portability and rapid time-to-market of the architecture.Many of the papers have aimed at either multiplier less architecture or shift based multipliers [14].The other wavelet filters which are frequently considered for optimization are Daubechies-4 (D4), Daubechies-6 (D6), Daubechies-8 (D8), CDF (2,2), (6,10) etc. Universal embedded hardware implementation of a variety of wavelet kernels have been implemented in [8], [21].The implementation methods are either based on parallel architecture of each kernel or processing element (PE).The parallel method implements multiple wavelet kernels in parallel, which helps in increasing speed at the cost of some extra hardware.While in processing element method, resources are shared between different wavelet kernels, hence lesser resources are needed as compared to parallel implementation.
The proposed work exploits the critical path of the design by using algorithms for retiming for pipelining and unfolding.The throughput of the proposed architecture has increased while as the computation time has reduced when compared with the conventional lifting architecture.The rest of the paper is organized as follows.Section II provides a brief introduction of the CDF (5,3) filter and its implementation.Section III describes the proposed algorithm and architecture.The implementation results are provided in Section IV.Finally, concluding remarks are given in Section V.

A. Mathematical Formulation
The standard lifting scheme has been divided into three stages: Split, predict and update as discussed in [5] and [7].Split: The original input sequence and the filter coefficients are split into two branches of even and odd components.To make sure that each branch gets only its desired components and the output samples remain the same as obtained by convolution method, down-sampling by 2 is required in each branch.Predict: Generate the prediction residual d[n] as the error in predicting odd samples from even input samples using predictor P. Update: The coarse approximation c[n] is accomplished by applying an update operator U to d[n] and adding to even input samples.The CDF (5,3) wavelet filter has the following coefficients: Lowpass: (-1/8, 2/8 ,6/8 ,2/8, -1/8); Highpass: (-1/2, 1, -1/2).The polyphase matrix of the filter is: The Split stage is given by factorization of the polyphase matrix The predict step can be interpreted by following equation The update step can be interpreted by following equation Where x 2k and x 2k+1 are the even and odd input samples and y 2k+1 and y 2k represent the low and the high output coefficients respectively [22].

B. VLSI Architecture
The basic architecture of (5,3) is implemented in [14], [21] and [16].[14] provides the conventional lifting based design of 2D DWT, as shown in Fig. 1.It consists of two stages of 1-D DWT, each stage having different length of delay element R(n).In the first stage (the Row processor), R(n) represents one delay element while in the second stage (the column processor), R(n) is N delays where N is the number of pixels in each row of the original image, as shown in Fig. 2. The input image (N × N ) is fed to the architecture pixel by pixel using row by row scanning.In each clock cycle, a single pixel is fed.In the row processor, 1-D DWT of each row is computed to yield the low and high frequency components of each row.Then the column processor computes full set of 2-D DWT components; low-low (LL), low-high (LH), high-low (HL) and high-high (HH).
In the row processor, the input stream is split into odd and even streams, then predict and update stages follow.The pixels that take part in the computation of present output y and x[n-4].Hence four delay elements are needed here.Similarly, in the column processor, five pixels from a single column are to be accessed for computation, so four row buffers (R(n) = N) are to be used.

III. PROPOSED ARCHITECTURE
Timing refers to the logic delays between sequential elements.When a design does not meet timing, we mean that the delay of the critical path, that is, the largest delay between flip-flops (composed of combinatorial delay, clk-to-out delay, routing delay, setup timing, clock skew, and so on) is greater than the target clock period.In other words, the critical path delay sets upper limit on the clock frequency of a design.The standard metrics for timing are clock period and frequency [23].The latency of the conventional structure is 2N + 2 which means 1026 clock cycles for 512 × 512 image or 2050 clock cycles for an image of size 1024×1024.A small increase of 5 more clock cycles will have no effect on overall latency.But this will definitely help in optimization of timing.In order to increase the clock frequency, the critical path delay has to be minimized [23], [24].The conventional implementation had a critical path delay of 8 adders and 4 multipliers, which will definitely needs to be minimized.[25] has provided two algorithms to check the feasibility of pipelining for obtaining a reduced clock period.Using Floyd-Warshall algorithm, we find that there is a feasibility of reducing the clock period.Now the pipelined architecture is obtained by cutset technique, in which a set of edges is removed from the signal flow graph in such a way that it creates two disconnected subgraphs.Then, a delay element is added in each of these edges.The resulting 2-D DWT for reduced clock period is shown in Fig. 4, with its 1-D DWT processor shown in Fig. 3.
As is evident from the Fig. 3, we have inserted delays in the cut-set tree so that the combinational critical path be evenly or near evenly divided.Now the critical path is 2 adders or 1 adder and 1 shifter.We can say the critical path has been reduced from 8T A + 4T M to 2T A or T A + T S , where T A , T M and T S are the computation times of adder, multiplier and shifter.The implementation results show that the minimum critical path delay has reduced from 13.566 ns to 3.01 ns,

B. Unfolding retimed architecture
The filter function and the clock speed of the architecture are fixed to their optimal levels so far.We need to improve its throughput and latency, which can be accomplished by simultaneously processing multiple adjacent rows [26].This requires multiple pixels to be input per clock cycle and exploits the fact that the windows for vertically adjacent outputs overlap significantly, as can be seen in Fig. 5 and Fig. 6.This is partially unrolling the vertical scan loop through the image.Note that the number of row buffers is unchanged.For an unroll factor of k (processing k rows of pixels in parallel) the combined window size is W × (W + k − 1).Of this, k rows of data are streamed in from input in parallel, so the remaining W-1 rows must come from row buffers.These buffers are arranged with a pitch of k rather than simply being chained.The parallel implementation will require k copies of the filter function.However, with some filters, the overlap in windows can even enable some of the filter function logic to be shared [27], reducing the resource requirements further.Our case corresponds to k = 2, so two parallel inputs will scan the image in alternate manner, one input scanning the odd rows of the image in pixel by pixel manner while other scanning the even rows of the input image.
Here two rows (even and odd) are processed in parallel with resources shared between the two parallel branches.The actual implementation is shown in Fig. 7, with column processor architecture as shown in Fig. 8.It is evident from Fig. 8 that two transformed coefficients are output in each clock cycle, so throughput is improved and reaches twice its initial value.Also now the output latency has been reduced to N + 7.
1) Multiplier Design: Multiplication in binary can be represented as a series of addition and shift operations.Usually multiplication requires more logic and computation time than addition.So, it is always a good practice to minimize the use of multipliers in hardware designs.Here, we have multiplication with the coefficients having values 0.5 and 0.25, which can be easily obtained using shift operations.Multiplication of the input with 0.5(= 2 −1 ) and 0.25(= 2 −2 ) requires the input to be shifted right by one and two bits respectively.

IV. IMPLEMENTATION RESULTS
We have proposed two 2-D DWT architectures of CDF (5,3) in this paper.The structures are synthesized on 6VLX760FF1760-2 Xilinx FPGA platform.The word length is set to 16 bit using signed arithmetic operations, while evaluating the design.The architecture is designed for image size of 512 × 512 and can be easily extended for any evensized image without modifying much.Also the hardware can be easily replicated for J number of levels.Zero padding is used for the computation of the whole image, in order to take care of the filter effects on image boundary.The architecture of [14] is also implemented alongside our proposed architectures to reflect the optimizations made therein.The comparison of resource utilization and timing is made in table I and table II.It is evident from table I that the hardware requirements of the proposed architecture has increased but the cost to be paid for the extra hardware is much less compared to the high performance achieved from the proposed design, as shown in table II.
It should be noted that the maximum clock frequency of the pipeline retimed architecture has drastically improved by minimizing the critical path delay.This has cost as many as 13 additional delay elements in total, maximum of five delays in feed-forward path i.e. five more additional latency cycles.The register balancing option available in Xilinx ISE has reduced four delay registers without affecting any other parameter.The unfolded retimed architecture has paralleled the pipeline retimed architecture to yield twice the throughput and half the output latency, by doubling the filter function, without any addition to the row buffers.So unfolded retimed architecture provides optimized throughput, reduced latency as well as optimized clock frequency design.In table III, the performance of the proposed architecture is compared with other existing efficient architectures.It is evident that our  architecture is the only one that provides optimized throughput along with minimized clock period to such extent.We have tested the implementation of the proposed architecture on several images of sizes 0.25 Megapixels (512 × 512), 5 Megapixels (2560 × 1920), 15 Megapixels (5184 × 2920) and 20 Megapixels (5184 × 3888).The computation time for the four image sizes are 0.397 ms, 5.84 ms, 16.86 ms and 22.50 ms respectively.The corresponding number of images that can be transformed per second are 2500, 171, 59 and 44 respectively.

V. CONCLUSION
In this paper, we have presented a high speed memoryefficient architecture that can convert images into their corresponding 2D-DWT coefficients at very high speed.The simulation results depict that approximately 4 rows of input image need to be stored in buffers for encoding an image.The proposed architecture can encode as many as 59 images, each of size 15 megapixels or 45 images, each of size 20 megapixels in approximately one second, making our architecture a potential candidate for video processing applications.Even higher resolution images can be transformed without affecting the video quality.Furthermore, the proposed architecture is quite simple and can be applied to images having even size.

TABLE III HARDWARE
AND TIMING PERFORMANCE COMPARISON AMONG DIFFERENT EXISTING 2D CDF 5/3 DWT ARCHITECTURES A OR T A + T S Medium