# Multiplierless Unified Architecture for Mixed Radix-2/3/4 FFTs

Fahad Qureshi, Jarmo Takala Faculty of Computing and Electrical Engineering Tampere University of Technology Tampere, Finland Email: fahad@tut.fi, jarmo.takala@tut.fi Anastasia Volkova, Thibault Hilaire Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France Email: anastasia.volkova@lip6.fr, thibault.hilaire@lip6.fr

Abstract—This paper presents a novel runtime-reconfigurable, mixed radix core for computation 2-, 3-, 4- point fast Fourier transforms (FFT). The proposed architecture is based on radix-3 Wingorad Fourier transform, however multiplication is performed by constant multiplication instead of general multiplier. The complexity is equal to multiplierless 3-point FFT in terms of adders/subtractors with the exception of a few additional multiplexers. The proposed architecture supports all the FFT sizes which can be factorized into 2, 3, 4 point systems. We also show that the proposed architecture has the same bound on the accuracy as the classical one.

Index Terms—Fast Fourier transform (FFT), Memory-based FFT, Mixed radix

### I. INTRODUCTION

Fast Fourier transform is an integral part of orthogonal frequency division multiplexing (OFDM) system, which is used in almost every communication application, e.g. 3GPP long term evolution (LTE), wireless personal area network (WPAN), wireless metropolitan area network (WMAN), and other mobile applications. In order to design FFT processors, various algorithms and architectures have been proposed over the last four last decades. These can be divided into two types, power-of-two and non-power-of-two. Certainly, the design of non-power-of-two FFT processors are more challenging because both data management and data processing is not regular [1]–[6].

Over time, various FFT algorithms and architectures have been proposed to design an efficient FFT processor. In order to meet the requirements of applications, mainly two principal FFT architectures are used: pipelined and memory-based [7]-[10]. In general, memory-based is preferred, as it has several advantages, such as low area or hardware resources occupied by the architecture and low power consumption with complex control logic. However, there are two important challenges during its design: one is the conflict-free memory access and other is the design of core for computation of butterflies (small point FFTs or radices). The design of the core for a single radix is relatively simple and a well-known subject. However, the design of multiradix architectures still proposes some challenges. Usually, multiradix architectures are used for hardware cost reduction purposes. This is achieved by reusing the adders and multipliers [1], [3], [11]. These cores are based on mapping the all radices onto computations on the largest radix. However, the radix-2/3/4 FFTs cannot be easily mapped on the radix-4 by the usual approach. This is due to the fact that radix-2/4 FFTs do not require any multiplication, while radix-3 has one non-trivial multiplication.

In this paper, we propose to replace the general multiplier of the radix-3 FFT by a constant multiplier and then to map the radix-2 and radix-4 on the multiplierless radix-3 architecture. As a result, we propose a core that is multiplierless and can be configured by multiplexers for multiradix purposes. This core can also be used for variable length FFTs, e.g. the 3GPP LTE requires various transform sizes: 128, 256, 512, 1024, 1536, 2048 and 12 - 1296 respectively [12]–[15]. Our design neither requires the separate hardware for each radix, nor requires the use of costly multiplier. Furthermore its accuracy is not worse than the accuracy of the classic implementations.

The paper is organized as follows. We briefly review the background on the individual architectures of 2, 3, and 4-point FFTs in Section II. Then, we explain the main idea behind the multiplier-less multiplication in fixed-point arithmetic in Section II-C. Then, Section III presents the proposed multiplierless architecture. In section Section IV we propose a computer arithmetic based approach on the error analysis of the obtained implementation. Finally Section V provides the main conclusions of paper.

# **II. TECHNICAL PRE-REQUISITES**

## A. Radix-2/3/4 FFT

The Cooley-Tukey FFT algorithm [16] is based on the decomposition of a large FFT into two small FFTs. This decomposition is applied recursively until it reaches the small point FFTs. The small FFTs are the basic processing elements of the FFT computation, they are often referred to as butterflies or radices.

The radix-2 FFT can be expressed as;

$$\begin{bmatrix} X(0) \\ X(1) \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \cdot \begin{bmatrix} x(0) \\ x(1) \end{bmatrix},$$
 (1)

where  $x(\cdot)$  and  $X(\cdot)$  represent the vectors of the input and output sequences respectively.

The radix-3 FFT is the following computation:

$$\begin{bmatrix} X(0)\\ X(1)\\ X(2) \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1\\ 1 & -0.5 - \frac{\sqrt{3}}{2}i & -0.5 + \frac{\sqrt{3}}{2}i\\ 1 & -0.5 + \frac{\sqrt{3}}{2}i & -0.5 - \frac{\sqrt{3}}{2}i \end{bmatrix} \cdot \begin{bmatrix} x(0)\\ x(1)\\ x(2) \end{bmatrix}.$$
 (2)

Finally, radix-4 corresponds to

$$\begin{bmatrix} X(0) \\ X(1) \\ X(2) \\ X(3) \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & i & -1 & -i \\ 1 & -1 & 1 & -1 \\ 1 & -i & -1 & i \end{bmatrix} \cdot \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \end{bmatrix}.$$
 (3)

The Signal Flow Graphs (SFG) that represent the architectures of radix-2, 3, and 4 are shown in Fig. 1. The SFG of radix-2, and 4 are based on the Cooley-Tukey FFT algorithm, whereas the radix-3 SFG is based on Winograd Fourier transform algorithm (WFTA).

# B. Winograd Fourier Transform Algorithm

This algorithm has the minimum number of multiplications at the expense of introducing a few extra additions [17]. Although WFTA is very efficient for small prime size FFTs, for larger sizes the number of additions becomes too high for practical implementations. In the WFTA, the FFT computation equation is written as

$$[X(0)....X(N-1)]^T = O \cdot M \cdot I \cdot [x(0)....x(N-1)]^T,$$
(4)

where I is a matrix corresponding to additions between inputs, M is a diagonal matrix with the multiplications, and O is a matrix corresponding to additions after the multiplications. The advantages of such architecture is that the multiplication is either real or imaginary.

Denote  $c = r + j \cdot i$  to be a complex constant only with real part  $r = \cos \alpha$ , i = 0, and  $x = x_{re} + j \cdot x_{im}$  be a complex variable. Then, the complex product  $Y = c \cdot x$  has real and imaginary parts that are computed with

$$\begin{bmatrix} Y_{re} \\ Y_{im} \end{bmatrix} = \begin{bmatrix} \cos \alpha & 0 \\ 0 & \cos \alpha \end{bmatrix} \begin{bmatrix} x_{re} \\ x_{im} \end{bmatrix}, \quad (5)$$

where imaginary part is zero. Conversely, multiplication of x by a constant c with only imaginary part  $r = 0, i = \sin \alpha$  is described as

$$\begin{bmatrix} Y_{re} \\ Y_{im} \end{bmatrix} = \begin{bmatrix} 0 & -\sin\alpha \\ \sin\alpha & 0 \end{bmatrix} \begin{bmatrix} x_{re} \\ x_{im} \end{bmatrix}.$$
 (6)

These considerations are in the basis of the WFTA and permit to transform (2) into (12). Therefore, WFTA requires only a semi-complex multiplier for multiplication with complex data and provides two times smaller hardware cost.

# C. Fixed-Point Arithmetic

In exact arithmetic computations such a semi-complex multiplication would be easy to perform. However, in modern digital systems, the memory registers with finite capacity are used. Therefore, finite-precision arithmetic is used to represent values and perform computations.

Consider 2's complement Fixed-Point Arithmetic [18], [19]. In this arithmetic, real numbers are represented as integers



Fig. 1. Signal flow graph of radix-2/3/4.

scaled by an implicit quantization factor. Such arithmetic benefits from fast bit operations. However, some accuracy concerns are related to such passage to finite precision.

Let r and i be the real and imaginary part of the complex constant c. In our case we suppose that the input sequences are scaled such that r < 1 and i < 1, i.e. we have only fractional part to be represented. Therefore, if implemented with b bits, the real and imaginary parts are stored in machine as integers  $R, I \in z \in [2^{b-1}, -2^{b-1}-1] \cap \mathbb{Z}$ , which are obtained by

$$R = \lfloor 2^{b-1} \cdot \cos \alpha \rceil, \quad I = \lfloor 2^{b-1} \cdot \sin \alpha \rceil, \tag{7}$$

where  $\lfloor . \rceil$  represents the rounding operation (round to nearest). Therefore, the Fixed-Point counterparts of r and i are  $R \cdot 2^{-b+1}$ and  $I \cdot 2^{-b+1}$  respectively. The quantization error  $\epsilon$  depends on the chosen rounding mode. In case of truncation, we obtain that

$$-2^{-b+1} \le \epsilon < 0. \tag{8}$$

Another issue that inevitably follows the coefficient quantization is that the additions and multiplications are performed with finite number of bits as well, which may lead to computational errors. After being propagated, these errors may be significantly amplified. This issue will be addressed in Section IV.

TABLE I NUMBER OF OPERATIONS OF SMALL POINT FFTS.

| Size    | adds/subs | trivial mults | non-trivial mults |
|---------|-----------|---------------|-------------------|
| Radix-2 | 2         | 0             | 0                 |
| Radix-3 | 6         | 0             | 2                 |
| Radix-4 | 8         | 1             | 0                 |

### **III. PROPOSED ARCHITECTURE**

The architecture of radix is obtained by direct implementation of the SFG from Fig. 1, which requires a number of hardware components equal to operations. Table I present the number of operations of each radix respectively. Here, under a trivial multiplication we mean multiplication by -j that can be simply performed by swapping the real and imaginary part of data. As result, it does not influence the overall hardware cost. It can be seen from the Table I that there is no single radix on which we could to map the rest of the radices: radix-3 is the only one to have a multiplier but it lacks two adders for the radix-4 architecture.

We, on the other side, propose to replace the classic WFTA implementation by the multiplierless version, which introduces additional adders instead of the multiplier, and enables us to map the radix-2/4/FFTs on it.

# A. Multiplierless radix-3 WFTA architecture

A single multiplication by a constant is used in radix-3 WFTA architecture. It is possible to use shift-and-add circuit to efficiently implement the multiplication instead of general multiplier.

Generally, the canonical signed digit representation is used to reduce the number of non-zero digits with respect to the simple binary representation and, therefore the number of adders [20]. Further simplification is achieved by single constant multiplication techniques [21], [22]. They exploit the redundancy in the multiplication by single constant. Additionally, improvement in accuracy and complexity can be achieved by addition-aware coefficient quantization method [23].

Using these techniques proposed in [20]–[23], we obtain a number of candidate coefficients whose fractional bits range are tabulated in Table II.

The selection is based on specifications like adder cost and coefficient quantization error  $\epsilon$ . It can be observed that three different techniques are required for different coefficient wordlengths in Table II:

• using 2 adders

$$C \cdot A = -1j \cdot ((A \cdot 2^3 - A) \cdot 2^4 - A)/2^7, \qquad (9)$$

• using 3 adders:

$$C \cdot A = -1j \cdot ((A \cdot 2^3 - x) \cdot 2^7 - (A \cdot 2^3 + x))/2^{10}, \quad (10)$$

• using 4 adders

$$C \cdot A = -1j \cdot ((((A \cdot 2^{3} - A) \cdot 2^{2} - (A \cdot 2^{3} - A)) \cdot 2^{3} - ((A \cdot 2^{3} - A) \cdot 2^{2} - (A \cdot 2^{3} - A))) \quad (11) - (A \cdot 2^{3} - A) \cdot 2^{11})/2^{14}.$$

TABLE II DIFFERENT QUANTIZATIONS OF THE NON-TRIVIAL COEFFICIENT  $\frac{\sqrt{3}}{2}$  and number of adders to realize the multiplierless structure.

| Fractional Bits | Coefficient          | Adders | $\epsilon$            |
|-----------------|----------------------|--------|-----------------------|
| 7               | $111 \times 2^{-7}$  | 2      | $1.12\times 10^{-3}$  |
| 8               | $222\times2^{-8}$    | 2      | $1.12\times 10^{-3}$  |
| 9               | $443\times2^{-9}$    | 3      | $7.91\times 10^{-4}$  |
| 10              | $887 \times 2^{-10}$ | 3      | $1.85\times 10^{-4}$  |
| 11              | $1774\times2^{-11}$  | 3      | $1.85\times 10^{-4}$  |
| 12              | $3547\times2^{-12}$  | 4      | $5.86 \times 10^{-5}$ |
| 13              | $7094\times2^{-13}$  | 4      | $5.86\times10^{-5}$   |
| 14              | $14189\times2^{-14}$ | 4      | $2.42\times 10^{-6}$  |
| 15              | $28378\times2^{-15}$ | 4      | $2.42\times 10^{-6}$  |



Fig. 2. Multiplierless architecture for radix-3 (Multiplication  $\left[-\frac{\sqrt{3}}{2}j\right]$ , 3 adders).

We selected the 10 - bit coefficient with 3 adders, which has the coefficient quantization error bounded by  $1.85 \times 10^{-4}$ . Usually, the shifts are free of cost as they are hardwired. The final multiplierless architecture is illustrated on the Fig. 2

### B. Multiplierless unified radix-2/3/4 FFT

The multiplierless unified architecture is based on mapping the 2/3/4-radix into a single processing core. First, we design a multiplierless radix-3 FFT architecture. It contains enough computational resources (adders and substractors) to map the radix-2/4, the main challenge is to reduce the number of multiplexers. To obtain this, it is important to find common parts in the signal flow graphs that can be mapped without multiplexers. Further, multiplexer can be avoided by setting the unused inputs of the circuit to zero, which removes the unnecessary connections of the circuit. Using these techniques, a solution with only seven two-to-one multiplexers controlled with two control signals, has been designed. The resulting architecture is shown in Fig. 3. The input and output relations are tabulated in Table III, where the dashes denote "don't care" conditions and 0 denotes that the input should be zeroed for proper operations. Finally, signals controlling the multiplexers are shown in Table IV, where dash denotes "don't care" condition.

#### **IV. ERROR ANALYSIS**

Any practical finite precision implementation of the above discussed classical and new FFTs will suffer from finiteprecision effects. Usually, for the finite-precision error analysis in signal processing domain a stochastic approach is



Fig. 3. Multiplierless unified radix-2/3/4 FFT.

TABLE III INPUTS AND OUTPUTS OF PROPOSED ARCHITECTURE.

| Input configurations |          | Output configurations |           |          |      |      |      |
|----------------------|----------|-----------------------|-----------|----------|------|------|------|
| Index IN             | FFT size |                       | Index OUT | FFT size |      |      |      |
|                      | 2        | 3                     | 4         | muex 001 | 2    | 3    | 4    |
| 0                    | x(0)     | x(0)                  | x(0)      | 0        | X(0) | X(0) | X(2) |
| 1                    | x(1)     | 0                     | x(2)      | 1        | X(1) | X(1) | X(1) |
| 2                    | 0        | x(1)                  | x(1)      | 2        | 0    | X(2) | X(3) |
| 3                    | 0        | x(2)                  | x(3)      | 3        | 0    | 0    | X(0) |

used: the quantization and computational errors are seen as a white uniformly distributed noise added to the signal. Such a statistical modeling, described in details by Widrow [24] or Constantinides [25], does not give precise values that the errors can take but provides the average and variance that describe the statistical dispersion of the error.

We, on the other hand, propose to perform accurate erroranalysis while exploiting the properties of Fixed-Point Arithmetic. We will model the errors as intervals (tighter than we usually deduce with statistical approach) and investigate their propagation. In order to analyze the accuracy of any implementation, we need to fully specify the way the arithmetic operations are implemented in hardware. In our implementations we ensure the following:

• Multiplication: our Fixed-Point multiplier, denoted  $\otimes$ , guarantees that the product of two real numbers x and y, that are represented with  $b_x$  and  $b_y$  bits respectively, is computed such that the output represented with  $b = \min\{b_x, b_y\}$  bits has its error is bounded by

$$|x \cdot y - \texttt{fxp\_mul}(x, y)| < 2^{-b+1}$$

 Addition: we suppose that the input data is scaled such that no over- or under-flow occur in the computations; and since the operands will always be in the same format, no

TABLE IV Control signals to obtain different FFT sizes.

| FFT size | $s_0$ | $s_1$ | $s_2$ |
|----------|-------|-------|-------|
| Radix-2  | 0     | 0     | 0     |
| Radix-3  | 1     | 1     | 0     |
| Radix-4  | 0     | 0     | 1     |

computational errors are introduced for the Fixed-Point addition.

# A. Classical WFTA radix-3 FFT

The classic WFTA radix-3 point FFT with a multiplier as shown in Fig. 2(c) can be represented as

$$[X(0)....X(N-1)]^{T} = B_{4} \cdot B_{3} \cdot W \cdot B_{2} \cdot B_{1}[x(0)....x(N-1)]^{T}$$
(12)

where matrices  $B_i(3 \times 3)$  contain only trivial terms (0, 1, -1)and  $W(3 \times 3)$  contains the non-trivial coefficient. Such computation can also be represented as an implicit system of linear equations:

$$\begin{cases} Jt = Nx \\ y = Lt \end{cases}$$
(13)

where

$$J = \begin{pmatrix} -1 & 0 & 0 & 0 \\ B_2 & -1 & 0 & 0 \\ 0 & W & -1 & 0 \\ 0 & 0 & B_3 & -1 \end{pmatrix}, N = \begin{pmatrix} B_1 \\ 0 \\ 0 \\ 0 \end{pmatrix}, \quad (14)$$
$$L = \begin{pmatrix} 0 & 0 & 0 & B_4 \end{pmatrix}$$

and  $J \in \mathbb{C}^{12 \times 12}$ ,  $N \in \mathbb{R}^{12 \times 3}$ ,  $L \in \mathbb{R}^{3 \times 12}$ ; and  $\mathbb{1}, \mathbb{0} \in \mathbb{R}^{3 \times 3}$ are identity and zero matrices respectively.

**Quantization errors:** using the bound (8), we know that instead of J we actually implement the quantized to b bits matrix  $J_q := J - \Delta J$ , where

$$|\Delta J_{i,j}| \leq \begin{cases} 2^{-b+1}, & \text{if } J_{i,j} \text{ is the non-trivial coefficient} \\ 0, & \text{otherwise} \end{cases}$$
(15)

Therefore, instead of system (13) we compute

$$\begin{cases} J_q t = Nx \\ y = Lt \end{cases}$$
(16)

**Multiplication errors:** consider the above assumptions concerning Fixed-Point multiplication and that the wordlength of non-trivial factor  $\frac{\sqrt{3}}{2}$  is at least equal to the the wordlength of the input. Then, instead of the exact vector t, we actually compute  $\hat{t} := t - \Delta t$ , where

$$|\Delta t_i| \leq \begin{cases} 2^{-b+1}, & \text{if } J_{i,:} \text{ contains non-trivial factor} \\ 0, & \text{otherwise} \end{cases}$$
(17)

Therefore, the actually implemented output  $\hat{y}$  is computed with

$$\begin{cases} J_q \hat{t} = Nx + \Delta t \\ \hat{y} = L \hat{t} \end{cases}$$
(18)

Then, for a given input vector  $x \in \mathbb{R}^3$ , the implementation error of the radix-3 WFTA is bounded by:

$$||y - \hat{y}|| \le ||L(J^{-1} - J_q^{-1})Nx - LJ_q\Delta t||,$$
 (19)

where  $||\cdot||_{\infty}$  is some norm. Thanks to the sparse nature of  $\Delta J$  and  $\Delta t$ , the error bound (19) can be simplified:

$$||y - \hat{y}|| \le ||Q \cdot x + p||,$$
 (20)

with

$$Q = \begin{pmatrix} 0 & 0 & 0\\ 0 & -2^{-b+1} & 2^{-b+1}\\ 0 & 2^{-b+1} & -2^{-b+1} \end{pmatrix}, \quad p = \begin{pmatrix} 0\\ -2^{-b+1}\\ 2^{-b+1} \end{pmatrix}$$
(21)

Matrix Q represents the impact of the coefficient quantization errors and vector p holds the impact of multiplication errors to the final output.

**Remark:** Consider the case, when the coefficient  $\frac{\sqrt{3}}{b}$  is quantized to  $b_c$  bits, and  $b_c$  is less than the wordlength b of the input/output of the FFT algorithm. Then, the bound p will change to

$$p = \left(\begin{array}{c} 0\\ -2^{-b_c+1}\\ 2^{-b_c+1} \end{array}\right)$$

#### B. Multiplierless radix-3 point FFT

Similarly to the case of the WFTA radix-3 FFT, our multiplierless design can be represented as a system of linear equations (13) but with slightly different matrix J. The difference between two designs is the use of shifts and adds technique instead of the multiplication. In order to show that the classic and proposed designs have equivalent behavior in terms of errors, consider following reasoning.

**Quantization errors:** the multiplierless architectures that we proposed in Section III-A are based on coefficients that have already been quantized (error bounds are given in Table II). Therefore, we obtain that modeling (15) holds.

**Multiplication errors:** if the additions in multiplierless architecture are performed with enough guard bits (it can be shown, that only 2 bits are required), then the error bound (17) is satisfied as well.

Therefore, our modeling of the error propagation stays the same and the new multiplierless design has an error bound equivalent to the implementation with a multiplier (or better if additions in the multiplierless multiplication are done with care).

#### V. CONCLUSION

In this paper, we have presented a multiplierless architecture for the radix-3 WFTA and provided the shift-andadders specifications for the non-trivial constant quantized to the 6 - 14 bits, which covers the majority of real-life applications. This radix-3 multiplierless architecture permitted us to map the radix-2 and 4 classical FFTs upon it. The proposed universal architecture requires only a few additional multiplexers controlled by two control signals. Finally, we proposed a new modeling for the Fixed-Point error analysis of the classical WFTA and showed that the new multiplierless design respects the same error bounds.

# ACKNOWLEDGMENT

This work is partially supported by Tekes - the Finnish Funding Agency for Innovation under funding decision 40142/14 (StreamPro) in the FiDiPro program and by the ANR (French National Research Agency) grant ANR-13-INSE-0007-02 MetaLibm.

### REFERENCES

- F. Qureshi, M. Garrido, and O. Gustafsson, "Unified architecture for 2, 3, 4, 5, and 7-point DFTs based on winograd fourier transform algorithm," *Electron. Lett.*, vol. 49, no. 5, pp. 348–349, February 2013.
- [2] T. Patyk, F. Qureshi, and J. Takala, "Hardware-efficient twiddle factor generator for mixed radix-2/3/4/5 FFTs," in *Proc. IEEE Workshop Signal Process. Syst.*, Oct 2016, pp. 201–206.
- [3] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, "Hardware efficient mixed radix-25/16/9 FFT for LTE systems," *IEEE Trans. VLSI Syst.*, vol. 23, no. 2, pp. 221–229, Feb 2015.
- [4] M. Garrido, "A new representation of FFT algorithms using triangular matrices," *IEEE Trans. Circuits Syst. I*, vol. 63, no. 10, pp. 1737–1745, Oct 2016.
- [5] S. J. Huang and S. G. Chen, "A high-throughput radix-16 FFT processor with parallel and normal input/output ordering for IEEE 802.15.3c systems," *IEEE Trans. Circuits Syst. I*, vol. 59, no. 8, pp. 1752–1765, Aug 2012.
- [6] F. Qureshi and O. Gustafsson, "Generation of all radix-2 fast Fourier transform algorithms using binary trees," in *Proc. Europ. Conf. Circuit Theory Design*, Aug 2011, pp. 677–680.
- [7] M. Garrido, "Efficient hardware architectures for the computation of the FFT and other related signal processing algorithms in real time," Ph.D. dissertation, Universidad Politécnica de Madrid, 2009.
- [8] M. Garrido, R. Andersson, F. Qureshi, and O. Gustafsson, "Multiplierless unity-gain SDF FFTs," *IEEE Trans. VLSI Syst.*, vol. 24, no. 9, pp. 3003–3007, Sept 2016.
- [9] M. Garrido, J. Grajal, M. A. Sanchez, and O. Gustafsson, "Pipelined radix-2<sup>k</sup> feedforward FFT architectures," *IEEE Trans. VLSI Syst.*, vol. 21, no. 1, pp. 23–32, Jan 2013.
- [10] M. Garrido, M. Snchez, M. L. Lpez-Vallejo, and J. Grajal, "A 4096point radix-4 memory-based FFT using DSP slices," *IEEE Trans. VLSI Syst.*, vol. PP, no. 99, pp. 1–5, 2016.
- [11] A. Wang, J. Bachrach, and B. Nikoli, "A generator of memory-based, runtime-reconfigurable 2<sup>n</sup>3<sup>m</sup>5<sup>k</sup> FFT engines," in *Proc. IEEE Int. Conf. Acoust. Speech Signal Process.*, March 2016, pp. 1016–1020.
- [12] K. Xia, B. Wu, X. Zhou, and T. Xiong, "An efficient prime factor memory-based FFT processor for LTE systems," in *Proc. IEEE Int. Symp. Circuits Syst.*, May 2016, pp. 1546–1549.
- [13] I. Cho, T. Patyk, D. Guevorkian, J. Takala, and S. Bhattacharyya, "Pipelined FFT for wireless communications supporting 128 – 2048/1536 -point transforms," in *Proc. IEEE Global Conf. on Signal* and Information Process., Dec 2013, pp. 1242–1245.
- [14] Altera, "1536-point FFT for 3GPP long term evolution," Tech. Rep., Oct 2007, application Note 480.
- [15] Sheng-Yeng, Kai-Ting, Chao-Ming, and Yuan-Hao, "Energy-efficient 128-2048/1536-point FFT processor with resource block mapping for 3GPP-LTE system," in *Proc. IEEE Int. Conf. Green Circuits Syst.*, June 2010, pp. 14–17.
- [16] J. Cooley and J. Tukey, "An algorithm for the machine calculation of complex Fourier series'," *Math. Comput.*, vol. 19, pp. 297–301, 1965.
- [17] S. Winograd, "On computing the discrete Fourier transform," Math. Comput., vol. 32, pp. 175–199, 1978.
- [18] J. V. Newmann, "First draft of a report on the edvac," Tech. Rep., 1945.
- [19] T. Finley, "Two's complement," Cornell University lecture notes, 2000.
- [20] H. Liu and H. Lee, "A high performance four-parallel 128/64-point radix-2<sup>4</sup> FFT/IFFT processor for MIMO-OFDM systems," in *Proc. IEEE Asia-Pacific Conf. Circuits Syst.*, Nov 2008, pp. 834–837.
- [21] O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and L. Wanhammar, "Simplified design of constant coefficient multipliers," *Circuits, Systems and Signal Processing*, vol. 25, no. 2, pp. 225–251, 2006.
- [22] J. Thong and N. Nicolici, "Time-efficient single constant multiplication based on overlapping digit patterns," *IEEE Trans. VLSI Syst.*, vol. 17, no. 9, pp. 1353–1357, Sept 2009.
- [23] M. Garrido, F. Qureshi, and O. Gustafsson, "Low-complexity multiplierless constant rotators based on combined coefficient selection and shift-and-add implementation (CCSSI)," *IEEE Trans. Circuits Syst. I*, vol. 61, no. 7, pp. 2002–2012, July 2014.
- [24] B. Widrow and I. Kollár, Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications. Cambridge, UK: Cambridge University Press, 2008.
- [25] W. L. G. A. Constantinides, Peter Y. K. Cheung, Synthesis and optimization of DSP algorithms. Kluwer, 2004.