# Low-Power Image-Segmentation VLSI Design Based on a Pixel-Block Scanning Architecture

K. Okazaki, K. Awane, N. Nagaoka, T. Sugahara, T. Koide, and H. J. Mattausch Research Institute for Nanodevice and Bio Systems, Hiroshima University, Kagamiyama, Higashi-Hiroshima, Japan Phone: +81-82-424-6265 Fax: +81-82-424-3499 E-mail: keita-okazaki@hiroshima-u.ac.jp

#### 1. Introduction

Recently, research on a variety of image-processing tasks involving segmented objects from video images is actively pursued [1, 2]. Only the objects that are of particular interest should be extracted in real-time as e.g. in object tracking or real-time recognition. For this purpose it is important to realize an image-segmentation preprocessing step for extracting the objects of interest at low power consumption but nevertheless with short processing time.

We have previously proposed a digital image-segmentation algorithm based on region growing [3], and implemented it in specialized hardware that can execute high-speed segmentation. By using this algorithm, both moving and still objects can be extracted, it can be applied to color images and noise is strongly suppressed. These features make the algorithm suitable for the image segmentation step in object tracking and recognition systems. We have also proposed a pixel-block scanning architecture [4] which further reduces the implementation area and the power consumption, while maintaining real-time processing capability. In addition processing time and implementation area can be tailored to the needs of a given application by optimizing shape and size of the pixel block.

In this paper, we propose a new technique for substantial further improvement of the segmentation efficiency, and verify its effectiveness by MATLAB simulation and a complete ASIC design.

## 2. Pixel-Block Scanning Image Segmentation

Fig. 1 shows the concept of the pixel-block scanning method for image segmentation. The input image is divided into small-size parallel processing blocks (with Image Segmentation Elements; ISEs), and is scanned by selecting and processing the blocks one by one. Pixel status data for all blocks are stored in embedded memories with large access bandwidth. Complex address computation for the block data is avoided by matching the block number to the memory address and by providing a very wide word length. Fig. 2 shows a block diagram of the complete VLSI implementation architecture. Embedded memories are used during the segmentation process to store connection weights between pixels (connection-weight memory), the intermediate status of each pixel (leader-cell, segmentation, and excitation flag memories), and the final segment number of the pixels (label-number memory). The image-segmentation



Fig. 1. Conceptual diagram of the algorithm for pixel-block scanning image segmentation.







Fig. 3. (a) Flowchart for continued block-internal region growing and (b) implementation circuit for the block size of the CMOS test chip.

processing is done in parallel for each pixel block with the image segmentation unit, and the label numbers are generated by the label generator. A controller generates the state signals of the segmentation algorithm, the memory addresses, and selects the pixel blocks for processing.

## 3. Continued Block-Internal Region Growing

In previously reported architecture [3], which enables already real-time segmentation at low power dissipation of a few-hundred mW, the region-growing process was moved to the next block immediately after one growth step. However, moving the block after each growth step requires a memory-write operation for saving the status data of the present block and a memory-read operation for reading the status data of the next block. As the word length of both memory accesses is quite large, avoiding these memory accesses after each growth step is expected to further reduce the power dissipation and also the total number of clock cycles required for segmentation. Therefore, a technique for not moving the processed block until the region growth ends within this block has been developed and added to the image-segmentation architecture.

Fig. 3 shows a flowchart of the continued block-internal growing technique and a circuit diagram for its implementation with a block size of  $2 \times 80$  pixels, which involves only the Excitation-Flag Memory. If there is at least one excitable pixel remaining in the present block, a region-growing-continuance signal (keep grow) is turned on, and the data maintained in additional registers is again used for the next region-growth cycle processed by the ISEs. In Fig. 3b, row1 and row2 denote the data of the processed block and row0, row3 denote the neighboring rows to this block. The data of the neighboring rows, which is needed for the region-growing within the block, is not modified and also not written back after the processed block is changed. If the next processed block is consecutive to the present block, a continuation signal (seq num) is turned on, and the clock cycles used for reading and writing the memory are concealed within the pipelined execution of the region growing.

The boundary-scan-only technique [4] is implemented for improving the processing efficiency by memorizing the top and the bottom block of the actually growing segment part. Only the blocks between top and bottom block are processed, to minimize the useless processing of blocks.

## 4. ASIC Design and Performance Evaluation

We designed a test chip for pixel-block scanning image segmentation in 180 nm CMOS technology, which implements the method of continued block-internal region growing and the boundary-scan concept of the grown region. An input image size of  $80 \times 60$  pixels and a block size  $80 \times 2$ pixels were chosen for the test chip. The main change coming from a larger input image would be an increased storage capacity of the embedded memory.

Fig. 4 shows the test-chip layout and Table I summarizes the design data. The present design area of  $4.66 \text{ mm} \times 4.66 \text{ mm}$  with a cell/core ratio of 39.6 % has still room for further compaction of the design. A benchmark set of 40 natural images (see example of Fig.5) was used to estimate image segmentation clock cycles and power consumption of the test chip, resulting in 8,441 average segmentation cycles and 44.66 mW average

| Table I. Design and performance | data of the | test chip | for pixel-block |
|---------------------------------|-------------|-----------|-----------------|
| scanning image segmentation.    |             |           |                 |

| scanning image segmentation.     |                       |
|----------------------------------|-----------------------|
| Technology                       | 180 nm CMOS           |
|                                  | with 5 metal layers   |
| Supply Voltage                   | 1.8 V                 |
| Input Image Size                 | $80 \times 60$ pixels |
| Processing Element Number        | $80 \times 2$ ISEs    |
| Design Area                      | 4.66 mm × 4.66 mm     |
| Cell/Core ratio                  | 39.6 %                |
| Operation Frequency              | 8.6 MHz (simulation)  |
| Average Processing Cycles for    | 8,441 cycle           |
| Image segmentation               |                       |
| Average Power Dissipation during | 44.66 mW              |
| Image Segmentation               | (simulation)          |
| Average Image-Segmentation       | 1,000 fps             |
| Performance                      | (simulation)          |



Fig. 4. Layout of the pixel-block scanning LSI for image segmentation in 180 nm CMOS technology.



Fig. 5. Image segmentation example for one of the 40 test images.

power-dissipation at 8.6 MHz. These results mean that even at the low clock frequency of 8.6 MHz, a segmentation performance of 1,000 fps (30 fps: standard video frame rate) is realized, which is much more than required for real-time processing. Consequently, a 30 times reduced clock frequency of 287 kHz with a power dissipation of about 1.48 mW is already sufficient for real-time video segmentation.

The specific improvement due to the method of continued block-internal region-growing amounts to a reduction in power consumption from 62.2 mW to 44.7 mW, and a reduction of the average clock-cycle number for image segmentation from 12,771 to 8,441 clock cycles. Thus the consumed energy for one image segmentation could be reduced by 52.6 %.

## 5. Conclusion

In this paper, we proposed and evaluated a continued block-internal growing technique for our previously reimage-segmentation architecture ported based on pixel-block scanning [4]. Evaluation by a test-chip design in 180 nm CMOS shows an average power consumption of 44.7 mW at 8.6 MHz and a segmentation performance of 1,000 fps. This means that the test chip can realize real-time segmentation with about 1.48 mW power dissipation at 287 kHz clock frequency. Due to the new technique of continued block-internal segment growing, the energy consumption per segmented image could be reduced by 52.6 % in comparison to our previously proposed architecture

## Acknowledgements

Part of this work is supported by the program "Interdisciplinary Research on Integration of Semiconductor and Biotechnology" for "Creation of Innovation Centers for Advanced Interdisciplinary Research Areas", the Ministry of Education, Culture, Sports, Science and Technology of Japan. **References** 

- [1] Y. Watanabe, et al., Proc. of SASIMI, pp. 63-68 (2007).
- [2] S. W. Seol, et al., Proc. of ITC-CSCC, pp. 260-263 (2003).
- [3] T. Morimoto, et al., *IEICE Trans. on Info. & Systs.*, vol. E87-D, No.2, pp. 500-503, (2004).
- [4] T. Morimoto, et al., Ext. Abst. of SSDM, pp. 590-591 (2006).