Reconfigurable Block-based Normalization Circuit for On-chip Object Detection

Aiwen Luo\(^1\), Fengwei An\(^2\), Xiangyu Zhang\(^2\), Lei Chen and Hans Jürgen Mattausch\(^{1,3}\)

Hiroshima University
\(^1\)Grad. School of Adv. Sciences of Matter, \(^2\)Grad. School of Engineering, \(^3\)Research Inst. for Nanodevice and Bio Systems
1-3-1 Higashi-Hiroshima, Hiroshima 739-8530, Japan
Phone: +81-82-424-5730 E-mail: aiwen-luo, anfengwei, zhangxiangyu, chen, hjm@hiroshima-u.ac.jp

Abstract

Normalization is critical for feature-vector optimization in vision-based detection systems. The presented block-based L1-norm-circuit architecture is configurable for different image-cell sizes, cell-based feature descriptors and image resolutions. The applied data-storage scheme includes flexible regulation according to customization parameters from the input. Pedestrian detection accuracy comparable to the L2-norm is obtained with greatly reduced computing complexity. An object-detection prototype system for performance evaluation in 65 nm CMOS implements the developed L1-norm circuit together with a cell-based HOG descriptor.

1. Introduction

Illumination intensity of light source, foreground-background contrast and the automatic gain control from a camera, etc., limit the performance of vision-based detection systems. To avoid degradation of recognition performance due to above issues, an effective local contrast normalization method turns out to be essential.

We present a general-purpose normalization-circuit architecture which is implemented by a reconfigurable ASIC-based solution in 65 nm CMOS technology. The simpler L1-norm instead of the L2-norm or the L2-Hys-norm is applied in our hardware design and is verified to meet a favorable tradeoff between computing complexity and detection accuracy. The developed L1-norm-circuit architecture is reconfigurable for any application using image-cell-based feature vectors (FVs), such as histograms of oriented gradients (HOG) [1] or Haar-like [2] descriptors. Additionally, strongly reduced on-chip memory requirements and wide-performance-optimized applicability to multiple FVs are achieved.

2. L1-norm-circuit architecture and its optimization

Equation (1) defines the local contrast Lp-norm operation for a block of \(2 \times 2\) image cells. Blocks slide with a fixed half-block stride (i.e., one cell stride) in our design. The rectangle cell size (CS) is only limited by the on-chip memory capacity.

\[
d_r = \sqrt[p]{d_{(i,0,0)}}^p + \sqrt[p]{d_{(i,0,1)}}^p + \sqrt[p]{d_{(i,1,0)}}^p + \sqrt[p]{d_{(i,1,1)}}^p\quad (1)
\]

Here, \(d_i\) (\(i \in [0, n-1]\)) refers to one component of the n-dimensional cell-FV (e.g. \(n=9\) in [1], while \(n=4\) in [2], and CS=8). Thus each cell response contributes to several components of the final normalized block-descriptor vector.

The developed reconfigurable circuit architecture for block-based L1-normalization consists of three main parts, as illustrated in Fig.1. The upper ‘Block part’ is used for caching and updating of intermediate block-summation results in the currently processed row of image blocks, which depends on assigned image resolution and CS in different applications. The necessary storage space for one row of blocks as well as the corresponding storing locations, i.e. read and write block addresses (BAs) for the block memory, are regulated in accordance with the cell number in the horizontal (CHN) and vertical (CVN) direction of the input image. Fig.2 illustrates more reconfigurable-architecture details of the ‘block address decoder’, which generates the BAs related to a given image cell according to eq. (2).

\[
BA = \begin{cases} 
B_A^1 = k - k / c \\
B_A^2 = k - k / c - 1 \\
B_A^3 = k - k / c - c \\
B_A^4 = k - k / c - c + 1 
\end{cases} \quad (2)
\]

Here \(k\) and \(c\) represent the current-cell order number in the whole image and its CHN value, respectively. Cells marked identically in Fig.3 are overlapped by the same number of blocks (1, 2 and 4 blocks for cells marked in black, gray and white, respectively), but contribute to different block types (i.e. different \(B_A\) in eq. (2)).

Cache and extension for one row of cells are handled in the lower ‘Cell part’ of the circuit architecture in Fig.1. The four cells related to a same block are outputted successively. To ensure a fixed-point computation in the middle ‘ Pipelined L1-norm processing’ part of Fig.1, each cell-FV dimension is multiplied by a factor of \(2^m\) \((m=12\) in the practical design). The final descriptor-FV of a detection window is constructed by combining the normalized block-FVs (i.e., \(d'_1, d'_2, \ldots, d'_n\)) of all related blocks. Important is, that the block and cell memories are reutilized after completion of the processing of the current block and cell rows, respectively. Consequently, on-chip memory requirements are strongly reduced.

3. Performance evaluation

An object-detection system with the developed general-purpose normalization circuit is prototyped in 65 nm CMOS and has 2.395 mm \(\times\) 1.195 mm core area, as shown in Fig. 4. The layout of L1-norm circuit in Fig. 5 verifies its logical density. Table I summarizes the properties of our general-purpose L1-norm circuit in comparison to [3]. Figures 6 and 7 show, that a detection system with L1-norm achieves comparable accuracy performance to the case where L2-norm is...
applied and works with much better accuracy than without normalization (no-norm) for different feature descriptors and classifiers. It is verified that 98.23% pedestrian-detection accuracy with L1-norm can be achieved by a pre-trained linear support vector machine (SVM) classifier, which allocates a trained threshold and a fixed weight for each dimension of a 3780-dimensional HOG descriptor for 5656 scan-windows of 64×128-pixel size.

4. Conclusions
The reported reconfigurable circuit architecture for L1-normalization is compatible with various cell-based feature descriptors, flexibly adapts data-storage resources to input parameters for customization, and enables a favorable trade-off between computing complexity and detection accuracy.

Acknowledgements
The VLSI-chip was fabricated through the chip fabrication program of VDEC, the University of Tokyo in collaboration with Renesas Electronics, Synopsys, and Cadence.

References

Fig.1 Reconfigurable circuit architecture for block-based L1-normalization of cell-based feature vectors.

Fig.3 The corresponding BAs for all cells according to their positions.

Fig.5 Layout of the developed circuit for L1-normalization.

Fig.6 Pedestrian-detection accuracy as a function of the cluster (reference) number. A cell-based Haar-like descriptor [2] and a nearest neighbor search (NNS) classifier are used.

Fig.7 Pedestrian-detection accuracy with cell-based HOG feature descriptor [1] and a pre-trained SVM classifier, based on 5656 scan-windows of 64×128-pixel size.

Table 1 Comparison with previous work

<table>
<thead>
<tr>
<th></th>
<th>IEICE [3]</th>
<th>This work</th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>65 nm CMOS</td>
<td>65 nm CMOS</td>
</tr>
<tr>
<td>Memory on chip</td>
<td>610 kb (one core)</td>
<td>328 kb (36kb for norm)</td>
</tr>
<tr>
<td>Normalization</td>
<td>L2-Hys-norm</td>
<td>L1-norm</td>
</tr>
<tr>
<td>Feature descriptor</td>
<td>Cell-based HOG</td>
<td>Cell-based HOG [1]</td>
</tr>
<tr>
<td>Resolution</td>
<td>1980×1080 pixels</td>
<td>128cells×∞ (up to 8×8 pixel cell)</td>
</tr>
<tr>
<td>Frame rate</td>
<td>30 fps (HD)</td>
<td>31fps (XGA)</td>
</tr>
</tbody>
</table>

Fig.2 More detailed structure of the ‘block address decoder’ control circuit (see Fig.1) for coordinating the reconfigurable normalization with cell position and image size, designated by the configuration parameters CNH (up to 128) and CNV (unlimited).