# Butterfly-Unit Based Programmable Computation Element Using Merged Module of Multiplication, Division and Square Root

Leo Karnan<sup>\*</sup>, Naoto Miyamoto<sup>\*</sup>, Kazuyuki Maruo<sup>0</sup>, Koji Kotani<sup>\*</sup> and Tadahiro Ohmi<sup>1</sup>

\* Graduate School of Engineering, Tohoku University, 05 Aoba, Aramaki, Aoba, Sendai, Japan
<sup>0</sup>Advantest Laboratories Ltd., 48-2 Matsubara, Kamiayashi, Aoba-ku, Sendai, Japan
<sup>1</sup>New Industry Creation Hatchery Center, Tohoku University, 10 Aoba-ku, Aramaki, Aoba, Sendai, Japan
Phone: +81-22-217-3977 FAX: +81-22-217-3986 e-mail : karnan@fff.niche.tohoku.ac.jp

### 1. Introduction

Implementing complicated application that performs many complex-computation operations in general purpose processor requires a lot of processing time. But an implementation of such algorithm with separated circuits on application-specific LSI consumes a large chip area. A reconfigurable computation element that can execute multiple operations is very important to design a small-area high-performance processor.

In this paper we introduce an FFT-processor's butterfly unit (BU) based programmable computation element (B<sup>2</sup>-PCE) suitable for signal processing applications. By modifying the standard configuration of the BU and provide some interconnect programmability, B<sup>2</sup>-PCE that can implement various operations frequently used in signal processing application has been developed. Small-area processing modules, which merges multiplication, division and square root functions that is used in B<sup>2</sup>-PCE as a key programmable element are also presented. With flexibility and functional improvements provided by B<sup>2</sup>-PCE, implementing functions with only a single computation element is considered much more efficient than executing them individually on separated circuits.

# 2. Butterfly Unit Based Programmable Computation Element (B<sup>2</sup>-PCE)

The newly developed FFT-processor is using BU as a computation element. Detailed schematic of a BU shown in Fig. 1 suggests that it has 4 complex-number inputs (2 of them are wheel factor data) and 2 complex-number outputs. It is composed of 4 multipliers, 3 adders and 3 subtractors connected with fixed interconnect, that made the FFT-processor completely unprogrammable, and has no function other than performing Fourier transforms.

In order to extend the processor function to cover wider target application especially in the signal processing purposes, we modified its BU's structure. At first, to make the most frequently used arithmetic operations in signal processing algorithms such as addition, subtraction, multiplication, division and square root functions implementable by BU, we have replaced the regular multiplication, division, and square root. We have implemented multiplication, division, and square root as a merged module rather than put it as separate modules, since our experimental result shows that a merged module consumes less area as described later (see section 3). Additional operations other than basic arithmetic, such as MAC (Multiply-Accumulate), correlation and butterfly-operation are also implementable in the  $B^2$ -PCE.

The detailed schematic of newly designed B<sup>2</sup>-PCE is shown in Fig. 2. The proposed  $B^2$ -PCE's structure is very similar to normal BU shown in Fig. 1 with some additional programmable interconnect resources. In case of 8 bit B<sup>2</sup>-PCE, 18 AND-gates, 48 2:1 multiplexers and 16 3:1 multiplexers are needed as additional interconnect resources. Total additional resources needed are about 250 gates, which is 33% smaller compared to 8 bits normal multiplier's total gates. It implies that only a few interconnect resources are needed to transform normal BU into a  $B^2$ -PCE that is capable of implementing many complicated functions used in many signal processing algorithms. Summary of implementable functions and 12 bits of configuration data needed to configure the  $B^2$ -PCE is shown in Table I. Of course by inserting a simple decoder circuit, configuration data can be reduced from 12 to 3 bits.

## 3. Merged Module

An individual multiplication, division, or square root circuit can be constructed using an array of full-adders [1] (see Fig. 3). Based on the similarity, we have designed 2 types of merged modules, multiplication/division/square root merged-module and multiplication/division merged module, are used in proposed  $B^2$ -PCE (schematic of 4-bits merged-module is shown in Fig. 4). Both constructed using an array of full-adders, so that common full-adders can be shared in the module effectively. In case of modules with 8 bits data input, the total gates used to construct the merged-modules are reduced up to 35% as compared with individual implementation (separated modules). We also evaluated module's worst-case delay. The increase of merged-module's delay is small, although it depends on the utilized function, as shown in Table II.

### 4. Result and Conclusions

In this paper, we proposed a small-area PCE based on FFT-processor's BU for signal processing application. Using the proposed  $B^2$ -PCE, not only simple arithmetic operations, but also very complicated signal processing algorithm such as pattern matching algorithm called Phase-Only Correlation (POC) is implementable. As shown in Fig. 5, the POC performs complex-computation operations such as FFT, MAC, square root, division, correlation and IDFT, which is hard to implement by general processor with reasonable cost. Since the proposed

 $B^2$ -PCE can implement all of the POC required functions consecutively, high-performance chip-area-effective POC processor can be designed. Designing POC processor using the proposed  $B^2$ -PCE significantly reduces the chip area for more than 40% as compared to the implementation using separated circuits.

#### Acknowledgements

The circuits and chip in this study was designed at the chip fabrication program of the VLSI Design Center (VDEC), the University of Tokyo in the collaboration with Synopsys and Cadence CAD tools.

#### References

[1] K.K Parhi, "A Systematic Approach for Design of Digit-Serial Signal Processing Architectures", IEEE Trans. Circuit Syst, vol. 38, no. 4, pp.358-375, Apr. 1991



Fig. 1 Diagram of standard butterfly unit



Fig. 2 Diagram of programmable computation element

|                           | Configuration Signals<br>(S1~S12) |                        | Configuration Signals<br>(S1~S12) |  |  |  |  |  |  |
|---------------------------|-----------------------------------|------------------------|-----------------------------------|--|--|--|--|--|--|
| Butterfly                 | 10*00000010                       | Square-root            | 01100001110*                      |  |  |  |  |  |  |
| Correlation               | 0100000001*                       | Complex Addition       | 10**00000101                      |  |  |  |  |  |  |
| Division                  | 01100000100*                      | Complex Subtraction    | 10*00000111                       |  |  |  |  |  |  |
| MAC                       | 01110100000*                      | Complex Multiplication | 11001110000*                      |  |  |  |  |  |  |
| 8 functions is utilizable |                                   |                        |                                   |  |  |  |  |  |  |











merged-module

Table II Summary of module's gates and delay

|              |                    | Gates                                                         | Delay [nsec] |         |       |       |  |  |
|--------------|--------------------|---------------------------------------------------------------|--------------|---------|-------|-------|--|--|
|              |                    |                                                               | MULTIPLIER   | DIVIDER | SQRT  |       |  |  |
|              | Separated Module A |                                                               | 1098.3       | 9.45    | 21.93 | 13.45 |  |  |
|              | Merged Module A    |                                                               | 715.0        | 11.54   | 26.12 | 17.65 |  |  |
|              | Separated Module B |                                                               | 775.7        | 9.45    | 21.93 | -     |  |  |
|              | Merged Module B    |                                                               | 603.0        | 11.54   | 24.02 | -     |  |  |
| Modules Type |                    |                                                               |              |         |       |       |  |  |
|              | А Туре             | A Type 8-bits Multiplier, 8-bits Divider, 16-bits Square Root |              |         |       |       |  |  |
|              | В Туре             | 8-bits Multiplier, 8-bits Divider                             |              |         |       |       |  |  |