



# Energy and Area Efficient Signal Processing Circuits Using Planar Embedded DRAM

Hongbin Sun\*, Kalyana Sundaram Venkataraman<sup>†</sup>, Yiran Li<sup>†</sup>, Ningde Xie<sup>‡</sup>, Nanning Zheng<sup>\*</sup> and Tong Zhang<sup>†</sup>

\* Xi'an JiaoTong University, Xi'an

E-mail: sunsir,nnzheng@mail.xjtu.edu.cn

<sup>†</sup> Renssaler Polytechnic Institute, Troy, NY, USA

E-mail: tzhang@ecse.rpi.edu

<sup>‡</sup> Storage Technology Group, Intel Corporation, Hillsboro, OR, USA

E-mail: ningdexie@gmail.com

Abstract-This paper studies the feasibility and potential of using planar embedded DRAM (eDRAM), which is completely compatible with CMOS logic process, to improve circuit implementation efficiency of signal processing algorithms. In spite of its apparent cell area efficiency advantage over SRAM, planar eDRAM is not being widely used in practice mainly due to its very short retention time. In this work, we contend that short retention time may not necessarily be a fundamental issue for implementing signal processing algorithms because they typically handle streaming data, exhibit regular and predictable data access pattern, and have large algorithm/architecture design space. This paper elaborates on the rationale and application using planar eDRAM in signal processing circuit implementations. For the purpose of demonstration, we use low-density parity-check (LDPC) code decoding as the test vehicle. Beyond straightforward SRAM replacement, we propose an interleaved read/write pagemode DRAM operation to reduce planar eDRAM energy consumption by leveraging LDPC code decoding data access pattern. We carried out detailed planar eDRAM SPICE simulations at 45nm node to obtain its characteristics, based on which we quantitatively evaluate the effectiveness of using planar eDRAM in this case study.

## I. INTRODUCTION

The non-stop advance of digital communication technologies over the past two decades, together with the ever increasing demand for multimedia data access anytime anywhere, have made multimedia communication an emerging killer application and a major driving force for the global communication and semiconductor industry. Because of the memory intensive nature of both baseband communication and multimedia signal processing, their silicon implementations must integrate a large amount of random access memory (RAM) with very high logic-memory interconnect bandwidth. In current design practice, RAM can be realized as either static RAM (SRAM) or dynamic RAM (DRAM). Although SRAM can be readily integrated with logic circuits on the same die, it has a relatively low storage density because each memory cell consumes 6 or 8 transistors. On the other hand, DRAM has a much higher storage density (at least by a factor of  $8 \sim 10$ ), but cannot directly integrate with logic circuits on the same die because explicit fabrication of capacitors for DRAM cells is not readily compatible with logic process. As a compromise, embedded DRAM (eDRAM) [8], [14] can solve

the process incompatibility issue at extra fabrication cost and storage density penalty.

Motivated by the attractive storage density advantages of DRAM over SRAM, there have been some studies on eDRAM without explicitly fabricated capacitors, e.g., thyristor RAM (T-RAM) [3], Z-RAM [11], and planar eDRAM [5], [12], [13], to obviate the process compatibility issue. Compared with the other capacitor-less eDRAM technologies that are still under development, planar eDRAM simply relies on the parasitic capacitance of MOS structure to realize charge storage and hence does not involve any new material/device issues. Nevertheless, as a penalty for its simplicity and complete process compatibility, planar eDRAM has a relatively very short retention time (e.g., few  $\mu$ s and even a few hundreds ns) due to the small parasitic storage capacitance. Therefore, the use of planar eDRAM in practice clearly faces a critical problem: Short retention time demands a very high eDRAM refresh frequency, which can more likely stall a user data access request and severely reduce memory system availability. This could result in significant overall memory system performance degradation and induce a large degree of data access latency uncertainty, which tends to make designers easily preclude the possibility of using planar eDRAM in practice.

In this work, we contend that, although planar eDRAM may not be readily applicable for general-purpose computing (e.g., being used as cache in microprocessors), it indeed has a promising potential in signal processing application-specific integrated circuits (ASICs). This is due to several characteristics shared by most signal processing functions, including: (i) The data streaming nature with very short data lifetime of signal processing can largely relax the memory retention time constraints; (ii) In sharp contrast to general-purpose computing, the data access pattern in most signal processing functions is very regular and predictable, which can be leveraged to hide the memory refresh from normal memory data access; (iii) There is typically a large algorithm and architecture design space for signal processing functions, hence we may possibly optimize the signal processing algorithm/architecture design geared to the use of planar eDRAM. In this paper, we for the first time formally propose the use of planar eDRAM in signal processing ICs and discuss the potential design issues and opportunities. Moreover, we use low-density parity-check (LDPC) code decoding as a case study to demonstrate this proposed design strategy. We developed specific algorithm and architecture design solutions to explicitly exploit planar eDRAM and improve their effectiveness in the case study. To facilitate the case studies, we developed a tool to model planar eDRAM system at 45nm node based on detailed SPICE simulations and popular memory modeling tool CACTI [1]. Detailed modeling and simulation results show that planar eDRAM indeed holds promising potential to replace conventional SRAM to improve energy efficiency and reduce silicon cost of memory-intensive signal processing ICs.

## II. RATIONALE AND APPLICATION OF USING PLANAR EDRAM IN SIGNAL PROCESSING ICS

In this work, we advocate the use of planar eDRAM instead of SRAM in memory-hungry signal processing circuits to reduce silicon cost and energy consumption. Although planar eDRAM can achieve much higher storage density than SRAM without incurring any fabrication process compatibility issues, it apparently suffers from relatively very short retention time (e.g., few  $\mu$ s and even a few hundreds ns)<sup>1</sup>. This demands very high memory refresh frequency, which may noticeably degrade the memory system availability and hence result in severe performance penalty. It is very intuitive that such refreshinduced memory system performance penalty can be more easily obviated if the data access pattern is more regular and predictable. In the context of general-purpose computing, the memory data access pattern tends to be irregular and unpredictable. Therefore, it tends to be difficult to efficiently address this refresh-induced problem for general-purpose computing. For example, Leung et al. [5] proposed a heavily multibanked planar eDRAM architecture with an integrated SRAM buffer that can hide the internal memory refresh operations and behave like a normal SRAM. This design solution has been commercialized by MoSys [10]. Nevertheless, the use of multi-bank architecture and SRAM buffer inevitably degrade the effective storage density (e.g., MoSys claims only 2x increase of storage density of its planar eDRAM system over SRAM). As a result, planar eDRAM is not being widely used in computing systems.

On the other hand, short memory retention time may not be a critical issue in the context of signal processing ASIC implementation. For most signal processing algorithms, their signal flows are largely deterministic and memory data access tends to be very regular and completely predictable. Therefore, we can readily leverage the inherent data access regularity and predictability to enable concurrent data access and internal eDRAM refresh without any conflicts, i.e., the refresh operation does not induce any memory system availability degradation. In addition, since many signal processing functions handle streaming data, e.g., baseband signal processing and multimedia signal processing, the eDRAM refresh operation may even be completely eliminated if the data stream throughput is high enough.

Beyond simply using planar eDRAM to drop-in replace SRAM to reduce silicon cost, there may be certain potentials for developing appropriate signal processing algorithm and architecture design solutions that could further exploit eDRAM characteristics to improve various system performance metrics. Due to the destructive read of DRAM, each memory read and write access incur the operation on all the memory cells along the same memory wordline, leading to large energy consumption overhead. In current design practice, designers can use page-mode read and write commands to reduce such energy consumption overhead. For our interested signal processing functions, their data access regularity can be readily leveraged to enable an aggressive use of page-mode operations to reduce energy consumption. Moreover, we may even modify the memory internal architecture and operation control so that it can much better adapt to the regular memory data access inherent in signal processing functions. We will use LDPC code decoding as an example to quantitatively demonstrate this point.

## III. PLANAR EDRAM CHARACTERIZATION AND MODELING

To facilitate the case studies on evaluating the use of planar eDRAM in LDPC decoder, we first carry out planar eDRAM characterization at 45nm node through SPICE simulations and develop a planar eDRAM system modeling tool based upon the well-known memory modeling tool CACTI [1]. Fig. 1 shows two possible planar eDRAM cell structures being considered in this work. Because of the regular data access pattern and streaming nature of most signal processing algorithms, planar eDRAM memory retention time tends to be less of an issue (e.g., a few  $\mu$ s and even a few hundred ns of retention time could be sufficient). Hence, our primary goal is to maximize the planar eDRAM effective storage density by leveraging the relaxed memory retention time constraint.



Fig. 1. Two possible planar eDRAM cell structures.

First, we carry out SPICE simulations to compare these two NMOS-NMOS memory cell structures and select the one that can ensure a longer retention time at the minimal cell size (i.e., the normalized width of the storage NMOS transistor is 1). We set the simulation temperature as  $77^{\circ}C$ , and assume that each individual memory cell array consists of 32 wordlines and 128 bitlines. We specify the memory cell retention time

<sup>&</sup>lt;sup>1</sup>In comparison, eDRAM with explicitly fabricated capacitors at extra fabrication cost can achieve much longer retention time, e.g., the eDRAM being used in IBM server processors has  $40\mu$ s retention time [2].

as the duration during which a sensing margin of 25mV can be established. We set the power supply as 1.1V. For the cell structure shown in Fig. 1(a), the voltage  $V_{storage}$  on the gate of the storage NMOS transistor is an important parameter that can significantly impact the storage capacitance and hence the achievable retention time. In addition, for both the cell structures shown in Fig. 1, the precharge bias voltage at the sense amplifier is also an important parameter that can influence the retention time. Fig. 2 shows the SPICE simulation results that reveal the impact of these two parameters on the memory cell retention time on these two cell structures. It shows that the cell structure as shown in Fig. 1(a) tends to be a better choice.



Fig. 2. (a) Simulation results to show the impact of  $V_{storage}$  and precharge bias voltage  $V_{precharge}$  on the retention time of the cell in Fig. 1(a) with normalized width of 1 at 45nm node; (b) Simulation results to show the impact of  $V_{precharge}$  on the retention time of the cell in Fig. 1(b) with normalized width of 1 at 45nm node.

In addition, it is well known that wordline underdrive (i.e., drive the wordline to a slightly negative voltage instead of 0V) can noticeably reduce the memory cell leakage current and hence improve the memory cell retention time. Hence, we carry out further simulations to evaluate the effectiveness of the use of wordline underdrive as shown in Fig. 3. Nevertheless, we note that the use of such wordline underdrive meanwhile can complicate the memory peripheral circuit implementation at the penalty of the longer memory cell retention time.



Fig. 3. Illustrates the variation in retention time using wordline underdrive for cells in Fig. 1(a) and (b) for the 45nm node

Based upon the extensive SPICE simulations, Table I lists the memory cell retention time under different storage NMOS transistor width for the two cell structures in Fig. 1 with the following configurations: precharge bias  $V_{precharge}$  of 0.75V, wordline underdrive of 0V (i.e., without using wordline underdrive), and storage voltage  $V_{storage}$  of 1.1V for the cell structure in Fig. 1(a).

TABLE I Retention time with respect to storage NMOS transistor width.

| Fig. 1     | Cell size $(F^2)$ | Width | Retention Time ( $\mu$ s) |
|------------|-------------------|-------|---------------------------|
|            | 24                | 1     | 3.00                      |
| 45nm:(a)   | 30                | 2     | 25.10                     |
|            | 36                | 3     | 50.30                     |
| 45.mm.(h)  | 28                | 1     | < 0.02                    |
| 431111.(0) | 36                | 2     | < 0.02                    |
|            | 44                | 3     | 1.5                       |

Based upon the above SPICE simulation results, we decide to use the cell structure as shown in Fig. 1(a) in our case study on LDPC decoding. Moreover, we modify the existing CACTI DRAM modeling tool to support the use of planar eDRAM cells. This developed CACTI planar eDRAM modeling tool can optimize the planar eDRAM structural organization and estimate memory system performance metrics such as area, access latency, and energy consumption. We use the hierarchical sense amplification strategy presented in [2], as illustrated in Fig. 4, to reduce the sense amplifier circuitry overhead and increase speed. The CACTI modeling tool uses the cell characteristics obtained from SPICE simulations with the following configurations: precharge bias  $V_{precharge}$  of 0.75V, without using wordline underdrive, and storage voltage  $V_{storage}$  of 1.1V. Using this developed modeling tool, we estimate the performance metrics of 2MB and 4MB planar eDRAM as shown in Table II. For the purpose of comparison, we also list the performance metrics of SRAM obtained from CACTI modeling.



Fig. 4. Illustration of the hierarchical sense amplification circuitry being used in the memory modeling tool.

TABLE II CACTI ESTIMATION RESULTS OF SRAM AND PLANAR EDRAM.

|                     | 2MB    |       | 4MB    |       |
|---------------------|--------|-------|--------|-------|
|                     | SRAM   | eDRAM | SRAM   | eDRAM |
| Area $(mm^2)$       | 14.099 | 3.749 | 30.075 | 8.308 |
| Access Latency (ns) | 2.055  | 1.824 | 2.353  | 2.016 |
| Access Energy (nJ)  | 0.892  | 0.852 | 1.217  | 1.206 |

## IV. CASE STUDY: USING PLANAR EDRAM IN LDPC CODE DECODER

LDPC codes, invented by Gallager [4] in 1962 and "rediscovered" [7], [16] in 1996, have attracted much interest over the past decade because of their superior error correction capability and highly parallelizable decoding algorithms. Today, LDPC codes have been widely used in real-life digital communication and magnetic data storage systems. Due to its block-based and soft-decision decoding nature, LDPC code decoder demands a large amount of embedded memory. This naturally motivates us to investigate the potential of using planar eDRAM to improve the silicon implementation efficiency of LDPC code decoder.

## A. Basics and Straightforward SRAM Replacement

An LDPC code is defined as the null space of an  $M \times N$ sparse parity check matrix. It can be represented by a bipartite graph, between M check (or constraint) nodes in one set and N variable (or message) nodes in the other set. An LDPC code can be decoded by iterative message-passing decoding algorithms that are directly matched to the code bipartite graph. Being naturally friendly to efficient VLSI implementation, quasi-cyclic (QC) LDPC codes have been predominantly used in practice and their high-speed decoder VLSI implementations have been well studied (e.g., see [6], [9], [15], [17]–[19]). The parity check matrix of a QC-LDPC code can be written as

$$\mathbf{H} = \begin{bmatrix} \mathbf{H}_{1,1} & \mathbf{H}_{1,2} & \cdots & \mathbf{H}_{1,n} \\ \mathbf{H}_{2,1} & \mathbf{H}_{2,2} & \cdots & \mathbf{H}_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{H}_{m,1} & \mathbf{H}_{m,2} & \cdots & \mathbf{H}_{m,n} \end{bmatrix}$$

where each sub-matrix  $\mathbf{H}_{i,j}$  is a  $p \times p$  circulant matrix. A circulant matrix is a matrix in which every row is a cyclically shifted version of the previous row. The cyclic structure greatly simplifies the decoder hardware implementation. Given such an  $(m \cdot p) \times (n \cdot p)$  code parity check matrix, we can straightforwardly obtain a partially parallel decoder [19] with a folding factor of p as illustrated in Fig. 5. The computation of p consecutive rows (or check nodes) and columns (or variable nodes) are handled by one check node unit (CNU) and variable node unit (VNU), respectively, in a time-division multiplexed manner. All the input and decoding messages are stored in an array of memory blocks in a correspondingly distributed manner, which directly matches to the partially parallel computation.



Fig. 5. Illustration of a partially parallel QC-LDPC code decoder.

Based upon this partially parallel decoder architecture, we design a LDPC code coder with the following configurations. The QC-LDPC code rate is 15/16 and each codeword contains 4k-byte user data. The code parity check matrix is  $(2 \cdot 1024) \times (32 \cdot 1024)$  and each  $1024 \times 1024$  circulant matrix has a column (and row) weight of 2. Each input and decoding message is represented using 6 bits, hence the decoder contains about 1Mb memory in total. We set the decoding iteration number as 16. Clearly, if we use the planar eDRAM instead of conventional SRAM, we can expect a noticeable silicon area reduction. To ensure a fair comparison, we use CACTI to model both the SRAM-based and planar eDRAM-based memory blocks at 45nm node. We set the target decoding throughput of 1.6Gbps at 16 decoding iterations, hence each decoding iteration takes  $1.45\mu$ s. For the planar eDRAM, we use the NMOS-NMOS cell structure as shown in Fig. 1(a), and choose the width of storage NMOS transistor as 1. This leads to  $3\mu$ s retention time, which is longer than the duration of one decoding iteration. Since all the decoding messages are updated every decoding iteration, we can clearly eliminate the refresh operations for the decoding message storage eDRAM. Table III summarizes the implementation results of the memory sub-system in this high-speed LDPC code decoder, when using either SRAM or planar eDRAM.

 TABLE III

 METRICS OF MEMORY SUB-SYSTEM IN LDPC CODE DECODER.

|                         | SRAM   | eDRAM  |
|-------------------------|--------|--------|
| Area (mm <sup>2</sup> ) | 6.8257 | 1.7711 |
| Power consumption (mW)  | 112    | 96     |

The results in Table III clearly suggests an attractive silicon area reduction potential if we simply replace the on-chip SRAM with planar eDRAM. This will lead to about 74% saving of the memory sub-system silicon area. In addition, the memory access power consumption can also modestly reduce. Beyond the straightforward drop-in replacement, we can further exploit the memory data access characteristics in LDPC code decoding to optimize the planar eDRAM implementation. In the next subsection, we will present a simple application-specific eDRAM optimization scheme for reducing eDRAM energy consumption.

## B. Application-Specific eDRAM Optimization

We propose a simple yet effective approach, referred to as page-mode read/write interleaving, that appropriately exploits the memory data access characteristics in LDPC code decoding to further reduce planar eDRAM energy consumption. In iterative LDPC code decoding, the decoding messages stored on each memory wordline are consecutively read, recalculated by CNUs or VNUs, and written back. Once all the decoding messages on one wordline have been updated once, we will move to the next wordline. Suppose each time a group of sbit decoding messages are read, recalculated, and written back, and each wordline contains t groups. In a straightforward manner, as illustrated in Fig. 6(a), memory data access associated with each wordline is accomplished by issuing t successive pairs of read and write commands. Since both DRAM read and write commands incur the activation and write-back of the entire wordline, such a straightforward data access strategy results in total 2t wordline activations and 2t wordline writeback for each wordline during each decoding iteration. This tends to lead to relatively high energy consumption overhead.

Very naively, the memory data access locality and regularity in LDPC code decoding can be leveraged to reduce the number of wordline activation and write-back operations and hence reduce the memory energy consumption. The most



Fig. 6. Flow diagram of (a) standard read/write operations, and (b) proposed page-mode interleaved read/write.

straightforward solution is to employ the page-mode DRAM data access as follows:

- 1) We first read an entire wordline through the page-mode read access and temporarily store all the *t* groups of decoding messages in a set of register files;
- The decoding computation units only need to read the decoding messages from the register files and write the recalculated decoding messages back to the register files during t cycles;
- 3) Finally, we write the updated t groups of decoding messages back to the wordline through page-mode write access.

Clearly, such a page-mode data access strategy can reduce the number of wordline activations and write-back operations by 2t times. Nevertheless, the extra register files will clearly incur silicon area overhead. In order to eliminate this silicon area overhead, we further propose to modify the normal eDRAM control policy so that it can support *interleaved* page-mode read/write operations. We note that each eDRAM sub-array contains an array of sense amplifiers that hold the data of an entire wordline after activation and before writeback. Intuitively, we can utilize this existing array of sense amplifiers to emulate the register files in the above solution. To support such operation, the page-mode read and write must be interleaved, and the corresponding memory data access can be described as follows and illustrated in Fig. 6(b):

- We first read an entire wordline through page-mode read access and simply let the sense amplifiers to hold all the data;
- Every cycle, the decoding computation units reads one group of s-bit decoding messages from s sense amplifiers, recalculates and writes them back to the same s sense amplifiers;
- 3) After t cycles, the data stored in all the sense amplifiers have been updated and is written back once to the

wordline.

To evaluate the energy saving potential, we further modified the CACTI-based eDRAM modeling tool to support this simple page-mode read/write interleaving design strategy. Each eDRAM wordline stores 42 decoding messages. The results show that the energy consumed by updating all the decoding messages stored in one wordline is 4.13nJ. In comparison, if we use the conventional design practice, the energy consumption is 19.92nJ. We note that the gain in energy consumption is not linearly proportional to the decrease from 42 individual reads to a single page mode read. This is because this approach can only reduce the energy consumed by the individual memory sub-array, but data routing in memory still consumes the same amount of energy.

## V. CONCLUSION

In this paper, we advocate the use of high-density planar eDRAM in memory-intensive signal processing circuits. Because of their data streaming nature and regular and predictable data access, many signal processing algorithms can readily embrace the very short retention time of planar eDRAM. In addition, the large algorithm/architecture design space inherent in most signal processing applications can be leveraged to maximize the potential benefits of using planar eDRAM. We use LDPC code decoding as the test vehicle to demonstrate the potentials of using planar eDRAM instead of conventional SRAM in memory-intensive signal processing circuits. To facilitate the case study, we carry out SPICE simulations to characterize planar eDRAM memory cells at 45nm and further develop a CACTI-based planar eDRAM modeling tool. Based upon this memory modeling tool, we show that straightforwardly replacing SRAM with planar eDRAM can largely reduce the silicon cost. Beyond such straightforward drop-in replacement, we further develop an interleaved read/write page-mode DRAM operation to reduce planar eDRAM energy consumption for LDPC code decoding.

### ACKNOWLEDGMENT

This research was funded in part by grants from the Important National Science & Technology Specific Projects of China (No. 2010ZX01032-001-001-5) and the National Science Foundation (No. 0823971) of USA.

#### REFERENCES

- R. Balasubramonian, N. Muralimanohar and N. Jouppi, "Cacti: A tool to model large caches." http://www.hpl.hp.com/techreports/2009/HPL-2009-85.html, 2009.
- [2] J. Barth, W. Reohr and et. al., "A 500MHz random cycle, 1.5 ns latency, SOI embedded DRAM macro featuring a three-transistor micro sense amplifier." *IEEE Journal of Solid State Circuits*, vol. 43, no. 1, pp. 86-95, 2008.
- [3] H.J. Cho, F. Nemati and et. al., "A novel capacitor-less DRAM cell using thin capacitively-coupled thyristor (TCCT)". in Proc. of IEEE International Electron Devices Meeting (IEDM), pp. 311-314, 2005.
- [4] R.G. Gallager, "Low-density parity-check codes," *IRE Transactions on Information Theory*, pp. 21-28, 1962.
- [5] W. Leung, F. Hsu and M.E. Jones, "New generation of Z-RAM," in*Proc.* of *IEEE International ASIC/SOC Conference*, pp. 32-36, 2000.
  [6] Z. Li, L. Chen and *et. al.*,"Efficient encoding of quasi-cyclic low-density
- [6] Z. Li, L. Chen and *et. al.*, "Efficient encoding of quasi-cyclic low-density parity-check codes," *IEEE Trans. on Communications*, vol. 54, no. 1, pp. 71-81, 2006.
- pp. 71-81, 2006.
  [7] D.J.C. MacKay and R.M. Neal, "Near Shannon limit performance of low density parity check codes," *Electronics Letters*, vol. 32, pp. 1645-1646, 1996.
- [8] R. Matick and S. Schuster, "Logic-based eDRAM: Origins and rationale for use," *IBM J. Res. & Dev.*, vol. 49, pp. 145-165, 2005.
  [9] L. Miles, J. Gambles and G. Maki, "An 860-Mb/s (8158,7136) low-
- [9] L. Miles, J. Gambles and G. Maki, "An 860-Mb/s (8158,7136) lowdensity parity-check encoder," *IEEE Journal of Solid-State Circuits*, vol. 41, no. 8, pp. 1686-1691, 2006.
- [10] MoSys Inc. http://www.mosys.com/.
- [11] S. Okhonin, M. Nagoga and et. al., "New generation of Z-RAM," in Proc. of IEEE International Electron Devices Meeting (IEDM), pp. 925-928, 2007.
- [12] D. Somasekhar, S.L. Lu and *et. al.*, "Planar 1T-cell DRAM with MOS storage capacitors in a 130nm logic technology for high density microprocessor caches," in *Proc. of IEEE Solid State Circuits Conf.*, *ESSCIRC*, 2002.
- [13] D. Somasekhar, Y. Yibin and *et. al.*, "2GHz 2MB 2T gain cell memory macro with 128 GBytes/sec bandwidth in a 65 nm logic process technology," *IEEE Journal of Solid State Circuits*, vol. 44, no. 1, pp. 174-185, 2009.
- [14] G. Wang, K.C.H. Ho and et. al., "A 0.127μm<sup>2</sup> high performance 65nm SOI based embedded DRAM for on-processor applications," in *Proc. of International Electron Devices Meeting (IEDM)*, pp. 1-4, 2006.
- [15] Z. Wang and Z. Cui, "A memory efficient partially parallel decoder architecture for quasi-cyclic LDPC codes," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 15, no. 4, pp. 483-488, 2007.
- [16] N. Wiberg, "Codes and decoding on general graphs," in *Ph.D. Dissertation, Linkoping University, Sweden,* 1996.
- [17] B. Xiang, R. Shen and *et. al.*, "An area-efficient and low-power multirate decoder for quasi-cyclic low-density parity-check codes," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 18, no. 10, pp. 1447-1460, 2010.
- [18] K. Zhang, X. Huang and Z. Wang, "High-throughput layered decoder implementation for quasi-cyclic LDPC codes," *IEEE Journal on Selected Areas in Communications*, vol. 27, no.6, pp. 985-994, 2009.
- [19] H. Zhong, T. Zhang and E.F. Haratsch, "Quasi-cyclic LDPC codes for the Magnetic recording channel: code design and VLSI implementation," *IEEE Trans. on Magnetics*, vol. 43, no. 3, pp. 1118-1123, 2007.