

## **Kent Academic Repository**

Pham, Thinh Hung, Fahmy, Suhaib A and McLoughlin, Ian Vince (2016) Efficient Integer Frequency Offset Estimation Architecture for Enhanced OFDM Synchronization. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24 (4). pp. 1412-1420. ISSN 1063-8210.

## **Downloaded from**

https://kar.kent.ac.uk/55022/ The University of Kent's Academic Repository KAR

## The version of record is available from

https://doi.org/10.1109/TVLSI.2015.2453207

### This document version

Author's Accepted Manuscript

**DOI** for this version

## Licence for this version

CC BY-NC-ND (Attribution-NonCommercial-NoDerivatives)

## **Additional information**

## Versions of research works

#### **Versions of Record**

If this version is the version of record, it is the same as the published version available on the publisher's web site. Cite as the published version.

## **Author Accepted Manuscripts**

If this document is identified as the Author Accepted Manuscript it is the version after peer review but before type setting, copy editing or publisher branding. Cite as Surname, Initial. (Year) 'Title of article'. To be published in *Title of Journal*, Volume and issue numbers [peer-reviewed accepted version]. Available at: DOI or URL (Accessed: date).

## **Enquiries**

If you have questions about this document contact <a href="ResearchSupport@kent.ac.uk">ResearchSupport@kent.ac.uk</a>. Please include the URL of the record in KAR. If you believe that your, or a third party's rights have been compromised through this document please see our <a href="Take Down policy">Take Down policy</a> (available from <a href="https://www.kent.ac.uk/guides/kar-the-kent-academic-repository#policies">https://www.kent.ac.uk/guides/kar-the-kent-academic-repository#policies</a>).

# Efficient Integer Frequency Offset Estimation Architecture for Enhanced OFDM Synchronization

Thinh Hung Pham, Student Member, IEEE, Suhaib A. Fahmy, Senior Member, IEEE and Ian Vince McLoughlin, Senior Member, IEEE

Abstract—An integer frequency offset (IFO), in orthogonal frequency-division multiplexing (OFDM) systems, causes a circular shift of the sub-carrier indices in the frequency domain.IFO can be mitigated through strict RF front-end design but this is challenging and expensive. Therefore, IFO is estimated and removed at baseband, allowing the RF front-end specification to be relaxed, thus reducing system cost. For applications susceptible to Doppler shift, and multi-standard radios requiring wide frequency range access, careful RF design may be insufficient without IFO estimation. This paper proposes a novel approach for IFO estimation with reduced power consumption and computational cost. A four-fold resource sharing architecture reduces computational cost, while a multiplierless technique and carefully optimised wordlengths yield further power reduction while maintaining a good accuracy. The novel method is shown to achieve excellent performance, similar to the theoretically achievable bound. In fact, performance is significantly better than conventional techniques, while being much more efficient. When implemented for IEEE 802.16-2009, the proposed method saves 78% power over the conventional technique on low-power FPGA devices. The method is applicable to IEEE 802.11 and IEEE 802.22.

Index Terms-OFDM, digital signal processing, field programmable gate arrays.

#### I. INTRODUCTION AND RELATED WORK

OFDM is an effective modulation technique that has been widely adopted for both wired and wireless communication systems. However, OFDM performance is sensitive to receiver synchronisation. Carrier frequency offset (CFO), where the frequency of the receiver and transmitter are unmatched, causes inter-carrier interference (ICI). CFO can be increased as a result of the Doppler Effect and/or due to local oscillator instability [1]. The large CFO is split into fractional (FFO) and integer (IFO) frequency offset for estimation [2], [3], [4], [5], [6] IFO causes a circular shift of the subcarrier in the frequency domain while FFO results in ICI because of lost orthogonality between subcarriers. In many published works on OFDM synchronisation, the CFO is done by coarse- and fine frequency offset estimation. The fine frequency estimation estimates FFO whereas the coarse frequency estimation determine IFO [7]. Fig. 1 illustrates the FFO and IFO estimation in a typical OFDM system.

Nogami's method [8] estimates IFO by searching for a maximum correlation value between the known preamble symbol and a cyclically shifted version of the received preamble

T. H. Pham and S. A. Fahmy are with the School of Computer Engineering, Nanyang Technological University, Singapore (email: hung3@e.ntu.edu.sg) I. V. McLoughlin is with the School of Information Science and Technology,

University of Science and Technology of China (USTC)

symbol in the frequency domain. The technique is simple, but its performance degrades significantly in frequency selective channels. Maximum Likelihood Estimation (MLE) has been used to estimate IFO [5], [6], based upon the observation of pilots from two consecutive OFDM symbols. A reduced complexity alternative is to only compute over one preamble OFDM symbol, using differential encoding among adjacent subcarriers and correlation estimation [2], [3]. Li et al. [9] proposed a method using a cross ambiguity function. This is based on an energy-detection metric, and can be expressed in terms of time domain signals. The method provides high accuracy and gives a full range estimation of the IFO in the presence of frequency selective fading, but the estimation requires an exhaustive search. Although there has been significant research on IFO estimation methods, this has primarily been restricted to studies in simulation.

Field programmable gate arrays (FPGAs) have been used to implement software radio systems for over two decades, and represent an ideal platform that provides higher performance, and lower power compared to processor-based software radio platforms [10]. Several authors have presented hardware implementations of OFDM-based systems in the literature [11], [12], [13], [14], but these implementations only estimate FFO (no IFO estimation is present). Avoiding the need for IFO estimation by restricting CFO to the range tolerated by FFO estimation results in a set of very strict constraints on the design of the RF front-end. Moreover, in the state of the art cognitive radio platform, the large CFO can be resulted by the requirement to access multiple frequency bands [15], [16] and/or to work with intensive Doppler effects [4]. The large CFO that is larger than FFO estimation range (i.e. two subcarrier-spacings for 802.16) presents IFO. There are many recent published works [17], [18], [19] for IFO estimation in theory based on cross-corelation. Because of the huge hardware resource consumption of the state of the art IFO estimation methods, the lack of IFO estimation implementation is presented in published OFDM-based hardware platform.



Fig. 1. Baseband processing block diagram.

In this paper, we propose an efficient method for hardware implementation of IFO estimation. We show that this novel method and its implementation allow accurate IFO estimation to be achieved on low-power FPGAs with minimal hardware cost, resulting in relaxed front end design constraints, leading to potential cost savings in the RF electronics, and higher robustness in high mobility applications.

This paper is organized as follows: Section II presents IFO estimation and the signal model. In Section III, the proposed method is introduced and justified in comparison to previous work, while Section IV details simulations used to evaluate the method against conventional approaches. In Section V, the proposed IFO estimator method is implemented on FPGA; resource requirements and power consumption issues are discussed and analysed. Section VI concludes the paper.

#### II. INTEGER FREQUENCY OFFSET ESTIMATION

We consider a signal x(n) of an OFDM system with inverse fast Fourier transform (IFFT) length equal to N. Assuming the signal is transmitted over a frequency selective channel that has channel impulse response (CIR) h with length,  $L_h$ , and that it is also corrupted by additive white Gaussian noise (AWGN), the received signal, with frequency offset and timing offset, can be expressed in the time domain as

$$y(n) = \sum_{l=0}^{L_h - 1} h(l)x(n - \tau - l)e^{i(2\pi\xi \frac{n - \tau}{N} + \phi_0)} + w(n)$$
 (1)

where w(n) denotes AWGN in the time domain,  $\tau$ ,  $\phi_0$  are residual timing offset (RTO) and error phase, respectively, and  $\xi$  is the normalised CFO that can be divided in to a fractional (FFO) part  $\lambda$  and integer (IFO) part  $\epsilon$  as  $\xi = \lambda + \epsilon$ .

This paper focusses solely on IFO estimation, as FFO is assumed to be compensated by earlier stages of synchronisation, as has been investigated in detail by other authors [20], [4]. The received preamble symbol at FFT output is

$$Y(k) = e^{i(\phi_0 - 2\pi \frac{\tau k}{N})} H(k - \epsilon) X(k - \epsilon) + W(n)$$
 (2)

where W(k) and H(k) are the frequency domain representations of AWGN and CIR, respectively. As mentioned previously, IFO results in a cyclic shift in the frequency domain. By contrast, RTO causes a linear phase rotation on samples in the frequency domain. Based on a differential demodulation of the FFT output, the IFO can be conventionally determined with robustness to frequency selective channel and RTO using the correlation function [2] expressed by:

$$\hat{\epsilon} = \underset{\tilde{\epsilon}}{\operatorname{argmax}} \left| \sum_{k=1}^{N} Y^*(k-1)Y(k)X^*(k-\tilde{\epsilon})X(k-1-\tilde{\epsilon}) \right|$$
(3)

where  $(.)^*$  denotes complex conjugation,  $\hat{\epsilon}$ ,  $\tilde{\epsilon}$  are estimated and trial values of  $\epsilon$ , respectively, Y(k) and X(k) denote the  $k^{th}$  frequency symbol index of the received symbol and the known transmitted preamble, respectively, and the symbol size N is equal to the FFT size.

The estimated IFO can achieve high precision using crosscorrelation in the frequency domain, however implementing cross-correlation clearly involves a significant hardware overhead, with a multiplier needed for each element in the cross-correlation. Sign-bit cross-correlation [21] is a widely adopted approach to reducing correlation complexity using only the most significant bit (MSB) of signed numbers in the correlation computation. In this way, complexity is reduced at the cost of performance degradation. Despite the adoption of such methods, cross-correlation remains computationally expensive, especially when dealing with a large FFT size. It should be noted here that several IFO estimation methods have been published which claim robustness to frequency selective channels and RTO. However published FPGA implementations of these methods are lacking to date, possibly because the hardware costs are considerable – even when adopting a sign-bit cross-correlation approach.

#### III. PROPOSED IFO ESTIMATION METHOD

Our proposed method saves hardware resources and reduces power consumption by exploiting redundancy in the conventional approach to computing IFO. This enables an efficient resource sharing folded architecture to be adopted. Adjusting the precision of individual correlation computations within this novel architecture leads to a fine degree of control on the trade-off between performance and power consumption. Thanks to the significant hardware cost reduction achieved, this IFO estimator can be feasibly implemented on a low-power, limited-resource FPGA, while simultaneously ensuring synchronisation performance is maintained by a trade off between accuracy against hardware cost.

In [23], the authors investigated in multiplierless correlation based on a conventional transpose form structure and the authors demonstrated a trade-off between cost and accuracy of the application of multiplierless correlation for OFDM timing synchronisation. In [20], the authors presented a method for OFDM timing synchronisation and fractional frequency estimation. The timing synchronisation method can perform well in the case of large CFO that presents IFO. However the IFO is not estimated at this stage. This method is performed on the samples in time domain before FFT. In order to implement an OFDM baseband system that can tolerate the large CFO. IFO estimation needs to be implemented after FFT. The method for the IFO estimation is performed on data symbol sequence in frequency domain. In this paper, we present the algorithm for an efficient architecture of IFO estimation. Our proposed method allows a circuitry of IFO estimator can be implemented with low hardware consumption to enhance the robustness of OFDM baseband system to the large CFO.

For our case study, we consider the long preamble of IEEE 802.16-2009 [22] for estimating IFO values. The preambles defined in other recent standards (such as 802.11, 802.22) have a similar structure and so the proposed method can also be applied to those standards.

#### A. Proposed Algorithm

Firstly, we assume that the RF front end can provide CFO stability in a range,  $R_{CFO}$ , from -14 to +18 sub-

carrier spacings, which is greatly relaxed compared to the strict RF front end constraints in 802.16 that would typically otherwise lead to increased RF hardware costs. Recall that in many practical implementations, IFO estimation is avoided by restricting the CFO range to within that tolerated by the FFO estimator; a typically much tighter bound of -2 to +2sub-carrier spacings, as we will show below. A larger IFO estimation range allows the baseband system to tolerate more CFO. Wider CFO tolerance, in turn, leads to more relaxed RF front-end specifications, reducing system cost. Calculating correlation for all IFO values in the assumed range is not necessary. Instead, we take advantage of the FFO estimation and correction that are performed prior to IFO estimation, as shown in Fig. 1, resulting in a reduced set of possible IFO values, as shown in the following. The metric M in (4) is widely employed for FFO estimation in many recent standard systems. The metric is computed on the short preamble which consists of periodic durations with length D [20]:

$$M(n) = \sum_{m=0}^{D-1} (s^*(n+m)s(d+m+D))$$
$$= e^{j2\pi\xi \frac{D}{N}} \sum_{m=0}^{D-1} |x(n+m)|^2, \tag{4}$$

where s(n) denotes the received signal with carrier frequency offset  $\xi$  with respect to x(n).  $\xi$  consists of two parts that are estimated based on the angle of M ( $\angle M$ ):

$$\hat{\xi} = \hat{\lambda} + \hat{\epsilon} = \frac{\angle M + 2\pi z}{2\pi \frac{D}{N}},\tag{5}$$

where z is an integer. The FFO is estimated as  $\hat{\lambda} = \frac{N \angle M}{2\pi D}$ . The remaining part after FFO is corrected is the IFO, denoted as  $\hat{\epsilon} = \frac{zN}{D}$ .  $\angle M$  is within the range  $-\pi$  to  $\pi$  and for many standards (such as such as 802.11, 802.16, and 802.22),  $\frac{N}{D} = 4$ . Hence, FFO is estimated in the range -2 to 2 sub-carrier spacings and the IFO can be expressed as  $\hat{\epsilon} = 4z \in R_{CFO}$ . Hence, there are 8 possible values for the IFO after correcting FFO , given the assumed CFO range. The possible IFO values are denoted  $S_{IFO} = \{-12, -8, -4, 0, 4, 8, 12, 16\}$ .

if IFO value is positive, the data symbols is rotated left for some symbol. Otherwise, if IFO value is negative, the starting data symbol is circular shifted right and positioned at some last symbols of OFDM symbol shown in Fig. 2(a). Basically, to compensate the positive IFO value, It require small buffer, shown in Fig. 2(c), to store some first data symbols to correct the order, whereas, to compensate the negative IFO value, It require to store almost data symbols in OFDM symbol, shown in Fig. 2(b).

Samples are pre-offset by 12 sub-carrier spacings prior to calculation to ensure that all possible values of IFO are positive:  $S'_{IFO} = \{0:4:28\}$ . This means that received symbols will only ever need to be shifted right to compensate IFO, thus reducing buffer memory requirements.

Secondly, a resource sharing folded architecture is designed to significantly reduce hardware cost. Conventionally, to obtain high accuracy, IFO estimation is computed across all pilots in the preamble. This results in considerable hardware overhead,



(a) Rotated data symbols caused by IFO



Fig. 2. IFO Correction.

especially with a large number of pilots, as is the case for IEEE 802.16-2009. The 802.16 long preamble includes Np=100 pilots, as illustrated in Fig. 3. These pilots are distributed, 50 pilots per side, at even sub-carrier spacings from 2 to 100 and from 156 to 254. The remaining sub-carriers are null. Evidently, as the number of pilots used increases, the correlation result shows greater robustness to noise. However, using more pilots represents an increase in hardware cost.

Detailed simulations can help us explore the impact of using a reduced pilot set. We demonstrate in Fig. 4 that the beneficial effect of calculating across additional pilots plateaus early.

We therefore propose making use of only a subset of pilots, while maintaining estimation accuracy.

Then, by spreading the chosen subset of pilots carefully in data symbol sequency in the frequency domain, it becomes possible to share resources when computing the cross-correlation, hence reducing area and power consumption.

When the proposed method is applied to IEEE 802.16-2009 offset estimation, the pilots used for the IFO computation are selected at sub-carrier indices that are multiples of 4, leading to a natural four-fold resource sharing architecture. Hence, the IFO estimation can be expressed as:

$$\hat{\epsilon} = \underset{\tilde{\epsilon} \in S'_{IFO}}{\operatorname{argmax}} |V_{\tilde{\epsilon}}|,$$

$$V_{\tilde{\epsilon}} = \sum_{k=1}^{N_P/4} P(4k)A1_{\tilde{\epsilon}}(k) + P(L+4k)A2_{\tilde{\epsilon}}(k), \quad (6)$$

where  $V_{\tilde{\epsilon}}$  is the cross-correlation between received pilots and pre-rotated known pilots and  $P(4k) = Y^*(4k-2)Y(4k)$ 



Fig. 3. Pilots in the long preamble of IEEE 802.16-2009.



Fig. 4. Fail rate of IFO estimation for different number of used pillots in AWGN channel

denotes the correlation of two consecutive received pilots. Since the pilots of the long preamble are distributed on two sides of the OFDM symbol in the frequency domain, at even sub-carrier spacings, L denotes the index of the first pilots in second half.  $A1_{\tilde{\epsilon}}(k) = X^*(4k-2-\tilde{\epsilon})X(4k-\tilde{\epsilon}),$   $A2_{\tilde{\epsilon}}(k) = X^*(L+4k-2-\tilde{\epsilon})X(L+4k-\tilde{\epsilon})$  denote the correlation of two consecutive pre-rotated known pilots of the first side and second side, respectively, of the preamble symbol corresponding to one IFO value  $(\tilde{\epsilon})$ . Let  $A_{\tilde{\epsilon}}$  be a known coefficient set as  $A_{\tilde{\epsilon}} = \{A1_{\tilde{\epsilon}}, A2_{\tilde{\epsilon}}\}$ . Let Si denote the set of used pilot indices for the proposed method (i.e.  $Si = \{(4:4:\frac{Np}{2}), (L:4:N)\}$ ) we present an algorithm given in Algorithm 1 to compute concurrently the cross-correlation operations.

#### **Algorithm 1** IFO correlation computation algorithm.

```
Init: \ k=0; \ n=0 repeat Every \ 4 \ cycles if 4k \in Si then Calculate P_{4k} for Each \ \tilde{\epsilon} \in S1'_{IFO} = \{0,4,8,12\} \ \text{do} V_{\tilde{\epsilon}} \ += P_{4k} A_{\tilde{\epsilon}}(n) end for \text{for } Each \ \tilde{\epsilon} \in S2'_{IFO} = \{16,20,24,28\} \ \text{do} V_{\tilde{\epsilon}} \ += P_{4k} A_{\tilde{\epsilon}}(n) end for n \ +=1 end if k \ +=1 until k > \frac{N_p}{2}
```

The received pilots whose indices are in Si are employed to compute  $P_{4k}$ .

Assuming that the data symbols are received from FFT in sequence sample by sample. The duration between these employed pilots is four clock cycles. For each value  $\tilde{\epsilon}$  in  $S1'_{IFO}$ , and  $S2'_{IFO}$ , the corresponding  $V_{\tilde{\epsilon}}$  can be computed separately at every clock cycle using two multi-add accumulation blocks with the corresponding known coefficients. Hence, 8 cross-correlations can be computed in the duration between two employed pilots. The comparison to the conventional methods in terms of time-area complexity is shown in Tab. I.

TABLE I
THE COMPUTATION TIME COMPARISON.

| Method               | Number of MACs  | Number of Cycles         |
|----------------------|-----------------|--------------------------|
| Dedicated Processor  | 1               | $N_I \times N_P$         |
| Accelerated Hardware | $N_I$           | $N_P$                    |
| Proposed Algorithm   | $\frac{N_I}{D}$ | $D \times \frac{N_P}{2}$ |

Where  $N_I$  is the number of possible IFO values and Ddenotes the duration between two employed pilots. For the proposed algorithm,  $N_I$ , D equal to 8, 4, respectively. The cross-correlation is performed based on multiply accumulate (MAC) operator. Assuming that MAC can be performed in one computation cycle. For a dedicated processor consisting of one MAC, It is required  $N_I \times N_P$ cycles to compute the IFO estimation. An accelerated Hardware using  $N_I$  number of MACs can reduce the computational time to  $N_P$  cycles. However, It pays a large amount of hardware usage. Whereas, in proposed architecture, only  $\frac{N_I}{D}$  number of MACs are effectively shared for parallelly computing  $N_I$  posible values of IFO thank to the spreading of the computed pilots in D clock cycles. This results in the time required for IFO estimation of proposed method be reduced to  $D \times \frac{N_P}{2}$ cycles. Moreover, it takes N cycles to stream the data symbol sequence. The proposed algorithm processing on the data symbol stream has IFO estimation computational time equal to  $D \times \frac{N_P}{2}$  cycles that smaller than the duration of N cycles. Therefore, the IFO estimation of proposed method does not cause the latency on data symbol stream.

Our contribution is to determine a way to efficiently share resources when computing  $V_{\tilde{\epsilon}}$ , as discussed in the next subsection.

Thirdly, although sign-bit cross-correlation is often used in conventional implementations to reduce computational complexity [4], [21], it also leads to reduced precision and hence reduced estimation performance, especially in the case of frequency selective channels. For this reason, we instead apply multiplierless correlation to enhance the accuracy of estimation compared to the sign-bit approach. In [23], the authors demonstrated a trade-off between cost and accuracy for multiplierless correlation in the case of OFDM timing synchronisation based on a conventional transpose form structure. We apply a similar technique to this new IFO estimation architecture by reducing the wordlength used to represent  $P_{4k}$  and investigating the impact on IFO estimation.



Fig. 5. Architectures of IFO estimators: (a) conventional approach, and (b) proposed method.

#### B. Proposed Architecture

Fig. 5(a) shows a conventional architecture for computing the IFO estimation, while Fig. 5(b) shows our proposed resource sharing architecture. The cross-correlators compute the values of  $V_{\tilde{\epsilon}}$ . The ArgMax module finds the maximum of the  $V_{\tilde{\epsilon}}$  values in order to identify the corresponding IFO estimate. The conventional approach in Fig. 5(a) employs separate cross-correlators to calculate each  $V_{\tilde{\epsilon}}$  result. This requires significant resources, including a large number of multipliers that may be not available on a limited-resource FPGA. The implementation in [4] replaces the multipliers by using sign-bit cross-correlations for the conventional architecture, significantly reducing computational complexity, but also decreasing estimation performance, especially in the case of frequency selective channels.

The novelty of our architecture is in achieving a significant hardware reduction by sharing resources across the IFO estimating correlations thanks to the proposed algorithm presented in previous sub-section, while also adopting multiplierless correlation [23] to further save resources.

The novel architecture comprises two parts:

Sharing stored pilot memory: There are 8 sets of  $A_{\tilde{\epsilon}}$  corresponding to 8 possible IFOs.

These sets of  $A_{\tilde{\epsilon}}$  can conventionally be pre-computed and stored separately. The memory employed to store  $A_{\tilde{\epsilon}}$  is the dual-port register file.

Thanks to the spreading of the computed pilots, the  $A_{\tilde{\epsilon}}$  sets have many identical pilots. This naturally allows sharing



Fig. 6. Circuit of known pilots shift register.



Fig. 7. Resource sharing approach for computing  $V_{\tilde{\epsilon}}.$ 

between pre-rotated pilot sets. Therefore, the *PilReg* block requires only 64 shared memory locations for the 8 sets instead of 400; an 84% reduction. Fig. 6 illustrates the  $A_{\tilde{\epsilon}}$  sets and circuitry for combining all  $A_{\tilde{\epsilon}}$  sets in the *PilReg* block.

Sharing correlation resources: The proposed method divides IFO estimation into multiple repeated computations with resource sharing based upon the four-sample timing between selected spread pilots. The pilots that are used to compute the correlation arrive every four cycles so there are three spare cycles between two consecutive computed pilots, allowing one multiply accumulate block to be scheduled to sequentially compute 4 separate correlations. Multiply accumulate blocks are shared among four sequential  $V_{\tilde{\epsilon}}$  computations over four successive clock periods. Fig. 7 demonstrates how this is done.  $P_k$  is received in every clock cycle.  $P_{4k}$  is the subsampling of  $P_k$ , taking a subset of the most significant bits from  $P_k$  every four cycles. The cross-correlation is performed with the values of  $P_{4k}$ . Two multiply accumulate blocks, MAC1 and MAC2, are used to compute the values of 8 cross-correlations  $V_{\tilde{\epsilon}}$  in parallel. Each multiplier performs multiplications sequentially between  $P_{4k}$  and the corresponding transmitted pilots in 4 sets of  $A_{\tilde{\epsilon}}$ . The products are accumulated to the values of  $V_{\tilde{\epsilon}}$ . The values of the correlation operations are stored separately in the corresponding buffers in the COR1, COR2 blocks. When the correlation computation is complete, the maximum operation, argmax|V|, is performed on 8  $V_{\tilde{\epsilon}}$  values to estimate the IFO. To obtain further resource savings,MAC1 and MAC2 is implemented by using multiplierless technique. The  $V_{\tilde{\epsilon}}$  formula in (7) is mathematically manipulated into what is effectively a multiply-accumulate form.

When one received sample is taken,  $V_{\tilde{\epsilon}}$  can be expressed in the form of accumulation:

$$V_{\tilde{\epsilon}} = A_{\tilde{\epsilon}} P_{4k} + V_{\tilde{\epsilon}},$$

$$= (\Re\{U\} - i\Im\{U\})(\Re\{P_{4k}\} + i\Im\{P_{4k}\}) +$$

$$(\Re\{V_{\tilde{\epsilon}}\} + i\Im\{V_{\tilde{\epsilon}}\}), \tag{7}$$

where  $\Re\{.\}$  and  $\Im\{.\}$  denote the real and imaginary parts, respectively.  $A_{\tilde{\epsilon}}$  are normalized to values U whose real and imaginary parts have values in  $\{-1,0,1\}$ , and the wordlength of  $P_{4k}$  and  $V_{\tilde{\epsilon}}$  in fixed point format can be adjusted to increase estimation accuracy at the cost of increased hardware resource consumption. The real and imaginary parts of  $V_{\tilde{\epsilon}}$  can then be computed:

$$\Re\{V_{\tilde{\epsilon}}\} = \Re\{U\}\Re\{P_{4k}\} + \Im\{U\}\Im\{P_{4k}\} + \Re\{V_{\tilde{\epsilon}}\}, 
\Im\{V_{\tilde{\epsilon}}\} = \Re\{U\}\Im\{P_{4k}\} - \Im\{U\}\Re\{P_{4k}\} + \Im\{V_{\tilde{\epsilon}}\} (8)$$

We are able to demonstrate that the algorithm and structure optimisations mentioned above retain competitive estimation accuracy compared to conventional approaches, while also offering significant reductions in hardware resource usage. This makes it possible to implement a high-performance OFDM receiver on a low-power FPGA that has a limited number of DSP blocks. In the following sections, we test the modified algorithm and our proposed implementation in simulation for different channels, then discuss the hardware implementation in more detail.

#### IV. SIMULATION

Many variants of the proposed method were simulated in MATLAB using different channel models and the parameter set of the IEEE 802.16-2009 downlink preamble. Performance of the implementation was compared to the theoretical performance of some state of the art methods. This was assessed primarily in terms of the probability of failed estimation (POFE) with respect to channel SNR. POFE, which is widely used to evaluate the performance of IFO estimation [2], [3], [5], measures the number of failed estimations divided by the total number of IFO estimations. Overall, 100,000 IFO estimations were simulated in AWGN and Stanford University Interim (SUI) [24] frequency selective channels. IFO estimation is performed with non-ideal FFO compensation, and FFO is determined and compensated using the method of Kim and Park [4]. The simulation also verifies the performance of the proposed method under the effect of residual timing offset (RTO) caused by imperfect STO estimation (assuming that STO estimation is still within the CP and does not cause ISI). In addition, a randomly generated amount of STO is added in the range from 0 to  $N_{CP} - L_h - 1$  (i.e. the RTO is in range from 0 to  $N_{CP} - L_h - 1$ ).

We first investigate the performance degradation compared to theoretical performance as a result of reducing the number of pilots as proposed. Next, we investigate the effect of wordlength optimisation. In both cases, comparisons are made with established methods in the literature that can be simulated but are otherwise infeasible for hardware implementation, namely the conventional method in [2] (*PCH*) that is applied to one training block with 100 pilots. In addition, two state of the art methods are also simulated for comparison. Firstly, metric *SY* from [3] as defined by,

$$\mu_{SY}(\tilde{\epsilon}) = \Re \left\{ \sum_{k=1}^{\frac{N}{2}} Y_{(2k-2)}^* Y_{(2k)} X_{(2k-\tilde{\epsilon})}^* X_{(2k-2-\tilde{\epsilon})} \right\}$$
(9)

where  $\hat{\epsilon} = \underset{\tilde{\epsilon} \in S_{IFO}}{\operatorname{argmax}} \{ \mu(\tilde{\epsilon}) \}$ . Secondly, metric *MM* from [5],

$$\mu_{MM}(\tilde{\epsilon}) = \Re \left\{ e^{i\frac{\pi}{4}} \sum_{k=1}^{\frac{N}{2}} Y_{(2k-2)}^* Y_{(2k)} X_{(2k-\tilde{\epsilon})}^* X_{(2k-2-\tilde{\epsilon})} \right\}$$
(10)

where  $\Re\{.\}$  denotes the real part. The authors are unaware of any published circuits for these methods, because of their complex computation. The very large hardware requirement of the respective metrics does not lend these methods to feasible implementation on a low cost, low power FPGA (unlike our proposed method).

#### A. Performance Comparison

The performance of the proposed method, denoted Prop, is evaluated in comparison to the theoretical performance of state of the art methods by Park et. al. [2], Shim and You [3] and Morelli and Moretti [5], denoted PCH, SY and MM, respectively in the previous subsection. The theoretical performance is computed with full precision using a full pilot set (100 pilots). However, it must be noted that implementing these directly in hardware would be prohibitive due to the large number of multiplication operations needed. Rather, hardware implementation would conventionally use sign-bit correlation instead of full precision correlation, as mentioned previously. Thus the full multiplication results shown here are undoubtedly better than those achievable in practice, and thus can be considered as upper performance bounds. For more realistic data, we also provide results from sign-bit correlation versions of the above, denoted PCH sb, SY sb, and MM sb respectively. The method in [4] employs sign-bit correlation to implement a joint STO and IFO estimator by performing a long cross correlation in time domain for STO estimation and an exhaustive search for the large assumed CFO range. This results in larger hardware usage compared to the methods in [5], [2], [3] that are performed in frequency domain similarly to the proposed method. Therefore, the proposed method is evaluated in comparison to the state of the art methods in [5], [2], [3].

The proposed method, evaluated against these, uses 50 spread pilots with indices that are multiples of 4. For the sake of comparison, an additional implementation of *PCH* is reported which, like the proposed method, uses 50 continuous pilots. This is denoted *PCH\_50*. Figs. 8 plots performance results for all methods in an AWGN channel with RTO and



Fig. 8. Fail rate of IFO estimation methods in AWGN channel with RTO.



Fig. 9. Fail rate of IFO estimation methods in SUI1 channel.

reveals that the proposed method generally performs well, especially at higher SNRs. Considering more realistic channel models, Figs. 9 and 10 show the performance of all methods in SUI1, and SUI2 channels respectively, and similarly show that the proposed method performs well, especially at higher SNRs.

Under these experimental conditions, *PCH* and *SY* achieve equivalent performance in AWGN without RTO and in SUI1 channels. However, *SY* degrades more drastically than *PCH* in the SUI2 channel at SNRs above 1 dB. *SY* also deteriorates in the case of AWGN with RTO. This method appears to be very sensitive to large RTO, while *MM* and *PCH* exhibit better robustness to RTO. The accuracy of *MM* is slightly lower than that of *PCH* at SNRs below 0 dB, while performance



Fig. 10. Fail rate of IFO estimation methods in SUI2 channel.

is very similar at larger SNRs. Also note the performance of the conventional approach implementations, *PCH\_sb*, *SY\_sb*, *MM\_sb* which degrade significantly with SNR, especially in the SUI1, SUI2 channels. Recall that these represent the typical practically implemented approaches where expensive multipliers are replaced with sign-bit correlation.

Apart from at very low SNRs, the proposed method, *Prop*, achieves almost identical performance to the simulated upper bound *PCH*, even in the presence of RTO. It should be noted that *Prop* achieves this while allowing the use of resource sharing through sparse pilot computation. This means that in addition to robustness, significant hardware savings are also achievable. The results also show that *Prop*, with its spread pilots, is more accurate than *PCH\_50* that uses the same number of pilots spread continuously.

#### B. Wordlength Optimisation

Since sign-bit correlation degrades IFO estimation performance, especially in frequency selective channels, we instead explore the savings possible with multiplierless correlation. The hardware complexity of this approach is dependent on the wordlength chosen for the correlation computation. We now investigate how wordlength affects the performance of the proposed implementation, again compared to the theoretical bound, PCH, as well as to the performance of a conventional sign-bit implementation, PCH sb. We denote wordlength using the notation Q1.f, meaning a single integer bit and f fractional bits. Evaluations are performed for f of 1, 2, 7, and 15 bits. These results are plotted with the labels Prop 1b, Prop 2b, Prop 7b, and Prop 15b, respectively. Figs. 11 plots the performance in AWGN with RTO, with all tested wordlengths in the proposed method performing comparably to PCH (and being much better than PCH\_sb at SNRs exceeding about 2 dB). Figs. 12 and 13 show the results when using the more realistic SUI1 and SUI2 channel models.



Fig. 11. Fail rate for different wordlengths in AWGN channel with RTO.



Fig. 12. Fail rate for different wordlengths in SUI1 channel.

It can be seen from the plots that each of the tested wordlengths achieves much better performance and exhibits greater robustness to frequency selective channels than the sign-bit realisation of the conventional approach, *PCH\_sb*. Additionally, these realisations of the proposed method do not suffer as much degradation in the presence of RTO. Moreover, it is possible to improve low SNR performance by adopting a longer wordlength with the proposed method, at a cost of increased hardware complexity. Increasing wordlength does improve results slightly, with the step from 1 to 2 bits being the most significant gain. By contrast, increasing from 2 to 7 or from 7 to 15 bits has little impact. In general, *Prop\_2b* achieves an estimation accuracy close to that of the theoretical performance bound, *PCH*, at intermediate and higher SNRs,



Fig. 13. Fail rate for different wordlengths in SUI2 channel.

even though it involves computation with fewer bits, and can hence be implemented more efficiently.

#### V. FPGA IMPLEMENTATION

The analysis in Section IV suggests that the proposed method offers comparable estimation performance to existing methods in the literature. As a result of the simplifications inherent in the proposed approach, this should be achievable at a reduced hardware cost. This section now quantifies this hardware cost, for an FPGA-based implementation. It is important to note that these hardware savings are accessible for a number of target implementation devices, although we are interested primarily in FPGA implementation as part of our work on leveraging FPGA reconfigurability in cognitive radios.

#### A. Conventional Approach

To obtain the theoretical performance previously discussed in Section IV and denoted as PCH, the computation of the estimation metric in [2], using 100 pilots, would require about 200 complex multipliers, resulting in the use of over 600 DSP blocks. This exceeds the available resources on small devices, and would leave insufficient resources for other tasks on larger devices. As the number of multiplications required for a full implementation of the approach is prohibitive, the conventional approach for implementation, as we have discussed, uses sign-bit correlation [4]. This conventional implementation uses all 100 pilots in the long preamble to perform sign-bit correlation, and multiply\_adds are eliminated at taps where the pilots of the long preamble are not used. This implementation mirrors the PCH\_sb in Section IV, and allows us to quantify the benefits of our proposed approach against a known reference benchmark.



Fig. 14. DSP block based 3-input adder for correlation.

#### B. Proposed Approach

The proposed architecture for IFO estimator implemented with several different wordlengths of  $P_{4k}$  and  $V_{\bar{\epsilon}}$  in (7) are compared, to allow us to explore the hardware costs associated with the respective implementations. Four fixed point formats for  $P_{4k}$ , are investigated: Q1.1, Q1.2, Q1.7, and Q1.15.  $V_{\bar{\epsilon}}$  is represented accordingly in Q7.1, Q7.2, Q7.7, and Q7.15 formats to avoid overflow.

In order to obtain a comprehensive optimised implementation, these circuits are each implemented using two different structures. The first uses only logic elements (LE) for computation, while the second uses Xilinx DSP48A1 [25] primitives. Considering (8),  $\Re\{V_{\tilde{e}}\}, \Im\{V_{\tilde{e}}\}\$  can be computed effectively using two DSP blocks as 3-input adders, instead of 4 blocks as would be usual. Fig. 14 illustrates how this is done for  $\Re\{V_{\tilde{\epsilon}}\}\$ , and similarly for  $\Im\{V_{\tilde{\epsilon}}\}$ . Note that the solution presented in Fig. 14 is optimised for QPSK modulated pilots (since their amplitudes are identical) as specified in IEEE 802.16, as well as in most OFDM-based standards. The normalisation performed in (8) allows the correlation to be reduced to two DSP blocks operating as 3-input adders (instead of 4 DSP blocks with multipliers as would be usual). These methods correspond to Prop\_1b, Prop\_2b, Prop\_7b, and Prop\_15b that were investigated for estimation accuracy in Section IV.

#### C. Implementation Results

The circuits were synthesised and fully implemented using Xilinx ISE 13.2, targeting the low-power Xilinx Spartan-6 XC6SLX75T FPGA. The results are reported in terms of the number of flip-flops (FFs), look-up tables (LUTs), and DSP blocks, along with dynamic power consumption, as summarised in Table II.

 $conv\_100p\_sb$  refers to the conventional approach, implemented using sign-bit correlation over 100 pilots.  $Prop\_fb\_LE$ ,  $Prop\_fb\_DSP$ , in which f=1, 2, 7 and 15 (corresponding to received sample format Q1.f), denote the circuits of corresponding wordlengths, implemented using logic elements and DSP blocks, respectively. Referring to the table, the proposed implementation demonstrates a significant improvement in resource usage and dynamic power consumption.

The hardware resources used by *Prop\_fb\_LE* and *Prop\_fb\_DSP* increase gradually, in terms of FFs and LUTs as the wordlength increases. The number of FFs used in *Prop\_fb\_DSP* and *Prop\_fb\_LE* is equal, while *Prop\_fb\_DSP* 

TABLE II
RESOURCE UTILISATION AND DYNAMIC POWER CONSUMPTION OF IFO
ESTIMATORS.

| FFs       | LUTs                                                                                          | DSPs                                                                                                                                                                                                                        | Frq (MHz)                                                                                                                                                                                                                                                                   | D. Power                                                                                                                                                                                                                                                                                                                                 |
|-----------|-----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3270 (3%) | 1837 (3%)                                                                                     | 3                                                                                                                                                                                                                           | 142                                                                                                                                                                                                                                                                         | 42 mW                                                                                                                                                                                                                                                                                                                                    |
| 328 (1%)  | 370 (1%)                                                                                      | 3                                                                                                                                                                                                                           | 136                                                                                                                                                                                                                                                                         | 9 mW                                                                                                                                                                                                                                                                                                                                     |
| 350 (1%)  | 390 (1%)                                                                                      | 3                                                                                                                                                                                                                           | 136                                                                                                                                                                                                                                                                         | 10 mW                                                                                                                                                                                                                                                                                                                                    |
| 460 (1%)  | 471 (1%)                                                                                      | 3                                                                                                                                                                                                                           | 136                                                                                                                                                                                                                                                                         | 12 mW                                                                                                                                                                                                                                                                                                                                    |
| 735 (1%)  | 696 (1%)                                                                                      | 3                                                                                                                                                                                                                           | 134                                                                                                                                                                                                                                                                         | 17 mW                                                                                                                                                                                                                                                                                                                                    |
| 328 (1%)  | 306 (1%)                                                                                      | 7                                                                                                                                                                                                                           | 78                                                                                                                                                                                                                                                                          | 11 mW                                                                                                                                                                                                                                                                                                                                    |
| 350 (1%)  | 319 (1%)                                                                                      | 7                                                                                                                                                                                                                           | 78                                                                                                                                                                                                                                                                          | 12 mW                                                                                                                                                                                                                                                                                                                                    |
| 460 (1%)  | 379 (1%)                                                                                      | 7                                                                                                                                                                                                                           | 77                                                                                                                                                                                                                                                                          | 14 mW                                                                                                                                                                                                                                                                                                                                    |
| 735 (1%)  | 591 (1%)                                                                                      | 7                                                                                                                                                                                                                           | 77                                                                                                                                                                                                                                                                          | 18 mW                                                                                                                                                                                                                                                                                                                                    |
|           | 3270 (3%)<br>328 (1%)<br>350 (1%)<br>460 (1%)<br>735 (1%)<br>328 (1%)<br>350 (1%)<br>460 (1%) | 3270 (3%)     1837 (3%)       328 (1%)     370 (1%)       350 (1%)     390 (1%)       460 (1%)     471 (1%)       735 (1%)     696 (1%)       328 (1%)     306 (1%)       350 (1%)     319 (1%)       460 (1%)     379 (1%) | 3270 (3%)     1837 (3%)     3       328 (1%)     370 (1%)     3       350 (1%)     390 (1%)     3       460 (1%)     471 (1%)     3       735 (1%)     696 (1%)     3       328 (1%)     306 (1%)     7       350 (1%)     319 (1%)     7       460 (1%)     379 (1%)     7 | 3270 (3%)     1837 (3%)     3     142       328 (1%)     370 (1%)     3     136       350 (1%)     390 (1%)     3     136       460 (1%)     471 (1%)     3     136       735 (1%)     696 (1%)     3     134       328 (1%)     306 (1%)     7     78       350 (1%)     319 (1%)     7     78       460 (1%)     379 (1%)     7     77 |

uses fewer LUTs since the DSP blocks are used for the 3-input additions.

The  $Prop\_fb\_LE$  implementations use 3 DSP blocks to compute  $P_{4k}$ , while  $Prop\_fb\_DSP$  require an additional 4 DSP blocks to perform the correlation.  $Prop\_fb\_LE$ ,  $Prop\_fb\_DSP$  both consume far fewer LUTs and FFs than the conventional  $conv\_100p\_sb$  sign-bit implementation.

For *Prop\_2b\_LE*, the number of FFs and LUTs is reduced by 90% and 79% respectively compared to the conventional *conv\_100\_sb* approach.

The maximum frequencies of circuits, reported after place and route, that are around 142 MHz, 136 MHz, 78 MHz for the conventional signbit-based circuit, the proposed LE-based circuits and the proposed DSP-based circuits, respectively, comfortably exceed the timing requirements for most OFDM-based systems, particularly for 802.16 synchronisation whose sampling frequency is below 25 MHz.

A post-place-and-route simulation was used to estimate the power consumption of the system at a clock rate of 50 MHz using the Xilinx XPower tool – also shown in Table II.

*Prop\_fb\_LE* implementations consume less power than the equivalent *Prop\_fb\_DSP* implementations. All implementations of the proposed method consume significantly less power than the conventional implementation. *Prop\_2b\_LE* consumes just 22% of the power consumed by *conv\_100p\_sb*.

In Section IV, we established that *Prop\_2b\_LE* easily outperforms the conventional approach in terms of estimation accuracy. Here we have shown that it does so with a significant hardware resource saving, and with significantly reduced power consumption. In fact, the estimation performance of *Prop\_2b\_LE*, in AWGN and SUI channels (except at very low SNR), is close to the theoretical bound of *PCH*, which would demand a significant amount of the FPGAs resources if it were implemented conventionally. Meanwhile, *Prop\_2b\_LE* is extremely efficient, consuming less than 1% of the resources available on a low-power Spartan-6 XC6SLX75 FPGA.

#### VI. CONCLUSION

This paper has investigated IFO estimation in OFDM-based systems such as IEEE 802.16. A technique is proposed for efficient implementation of IFO estimation, which aims in particular for a low-power and low-resource utilisation. Since

IFO estimation contributes significantly to the complexity of a robust OFDM synchroniser design, this work is important for multi-standard radios, or applications where significant frequency variation is expected. Robust IFO estimation can also allow for relaxed analogue RF constraints, leading to reduced cost.

A modified timing metric was derived which allows for resource sharing to reduce both resource requirements as well as power consumption. The proposed implementation makes use of a four-fold resource sharing architecture to significantly reduce hardware cost, while multiplierless correlation with optimised wordlengths is used to improve estimation accuracy in comparison to a conventional implementation using signbit correlation. The system has been evaluated theoretically, in simulation (to determine system-level IFO estimation performance), and through synthesis and post place-and-route implementation (to determine detailed resource utilisation and power consumption figures). The proposed method is shown to perform as well as current state-of-the-art methods that employ multiplier-based correlation, and yet with significantly improved power and resource requirements. Dynamic power consumption is reduced by 78% over even a sign-bit version of the conventional approach, yet it offers better estimation performance in both AWGN and frequency selective channels.

Beyond IEEE 802.16-2009, the folded resource sharing method, which leverages sub-sample spaced OFDM pilots, can be adopted for use in other OFDM standards (including IEEE 802.11 and IEEE 802.22). Since frequency offset estimation is required in many communication systems (including many that do not employ OFDM), the methods developed in this paper have potential for application elsewhere, particularly in expanding the potential design space for exploration during implementation.

#### REFERENCES

- M. Morelli, A. D'Andrea, and U. Mengali, "Frequency ambiguity resolution in OFDM systems," *IEEE Communications Letters*, vol. 4, no. 4, pp. 134–136, Apr. 2000.
- [2] M. Park, N. Cho, J. Cho, and D. Hong, "Robust integer frequency offset estimator with ambiguity of symbol timing offset for OFDM systems," in *Proc. Vehic. Techn. Conf. (VTC)*, 2002, pp. 2116–2120.
- [3] E.-S. Shim and Y.-H. You, "OFDM integer frequency offset estimator in rapidly time-varying channels," in *Asia-Pacific Conf. on Commun.*, 2006, pp. 1–4.
- [4] T.-H. Kim and I.-C. Park, "Low-power and high-accurate synchronization for IEEE 802.16d systems," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 16, no. 12, pp. 1620–1630, Dec. 2008.
- [5] M. Morelli and M. Moretti, "Integer frequency offset recovery in OFDM transmissions over selective channels," *IEEE Transactions on Wireless Communications*, vol. 7, no. 12, pp. 5220–5226, Dec. 2008.
- [6] D. Toumpakaris, J. Lee, and H.-L. Lou, "Estimation of integer carrier frequency offset in OFDM systems based on the maximum likelihood principle," *IEEE Trans. on Broadcasting*, vol. 55, no. 1, pp. 95–108, Mar. 2009.
- [7] C. Shahriar, M. La Pan, M. Lichtman, T. Clancy, R. McGwier, R. Tandon, S. Sodagari, and J. Reed, "PHY-Layer Resiliency in OFDM Communications: A Tutorial," *IEEE Communications Surveys Tutorials*, vol. 17, no. 1, pp. 292–314, 2015.
- [8] H. Nogami and T. Nagashima, "A frequency and timing period acquisition technique for OFDM systems," in Sixth IEEE Inter. Symp. on Personal, Indoor and Mobile Radio Communications (PIMRC), vol. 3, Sep. 1995.

- [9] D. Li, Y. Li, H. Zhang, L. Cimini, and Y. Fang, "Integer frequency offset estimation for OFDM systems with residual timing offset over frequency selective fading channels," *IEEE Trans. on Vehicular Technology*, vol. 61, no. 6, pp. 2848–2853, Jul. 2012.
- [10] M. Cummings and S. Haruyama, "FPGA in the software radio," *IEEE Communications Magazine*, vol. 37, no. 2, pp. 108–112, Feb. 1999.
- [11] J. Guffey, A. Wyglinski, and G. Minden, "Agile radio implementation of OFDM physical layer for dynamic spectrum access research," in *IEEE Global Telecommunications Conf (GLOBECOM)*, Nov. 2007, pp. 4051–4055.
- [12] A. Troya, K. Maharatna, M. Krstic, E. Grass, U. Jagdhold, and R. Kraemer, "Low-Power VLSI Implementation of the Inner Receiver for OFDM-Based WLAN Systems," *IEEE Trans. on Circuits and Systems I: Regular Papers*, vol. 55, no. 2, pp. 672–686, Mar. 2008.
- [13] S. J. Hwang, Y. Han, S. W. Kim, J. Park, and B. G. Min, "Resource efficient implementation of low power MB-OFDM PHY baseband modem with highly parallel architecture," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 20, pp. 1248–1261, Jul. 2012.
- [14] W. Fan and C.-S. Choy, "Robust, low-complexity, and energy efficient downlink baseband receiver design for MB-OFDM UWB system," *IEEE Trans. on Circuits and Systems I: Regular Papers*, vol. 59, no. 2, pp. 399–408, Feb. 2012.
- [15] T. Pham, S. Fahmy, and I. McLoughlin, "Efficient multi-standard cognitive radios on FPGAs," in *International Conference on Field Programmable Logic and Applications (FPL)*, Sept. 2014.
- [16] Y. Zhang, J. Mueller, B. Mohr, and S. Heinen, "A Low-Power Low-Complexity Multi-Standard Digital Receiver for Joint Clock Recovery and Carrier Frequency Offset Calibration," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 61, no. 12, pp. 3478–3486, Dec. 2014.
- [17] M. Morelli, L. Marchetti, and M. Moretti, "Maximum Likelihood Frequency Estimation and Preamble Identification in OFDMA-based WiMAX Systems," *IEEE Transactions on Wireless Communications*, vol. 13, no. 3, pp. 1582–1592, March 2014.
- [18] —, "Integer frequency offset estimation and preamble identification in WiMAX systems," in *IEEE International Conference on Communi*cations (ICC), June 2014, pp. 5610–5615.
- [19] Z. Zhang, L. Ge, F. Tian, F. Zeng, and G. Xuan, "Carrier frequency offset estimation of OFDM systems based on complementary sequence," in *IEEE International Congress on Image and Signal Processing (CISP)*, Oct. 2014, pp. 1012–1016.
- [20] T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Robust and Efficient OFDM Synchronisation for FPGA-Based Radios," *Circuits, Systems, and Signal Processing*, vol. 33, no. 8, pp. 2475–2493, Aug. 2014, Springer.
- [21] L. Schwoerer, "VLSI suitable synchronization algorithms and architecture for IEEE 802.11a physical layer," in *IEEE Inter. Symp. on Circuits and Systems (ISCAS)*, vol. 5, 2002, pp. 721–724.
- [22] IEEE Standard for Local and Metropolitan Area Networks Part16: Air Interface for Fixed Broadband Wireless Access Systems, IEEE std.802.16-2009.
- [23] T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Low-Power Correlation for IEEE 802.16 OFDM Synchronization on FPGA," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 21, no. 8, pp. 1549–1553, Aug. 2013.
- [24] V. Erceg, K. V. S. Hari, and M. S. Smith, "Channel models for fixed wireless applications," *Tech. Rep. IEEE 802.16a-03/01*, Jul. 2003.
- [25] UG389: Spartan-6 FPGA DSP48A1 Slice, Xilinx Inc., August 2009.



**Pham Hung Thinh** (S'13) received his B.S. degree in Electrical and Electronic Engineering at Ho Chi Minh City University of Technology, Vietnam, and MSc. degree in Embedded Systems Engineering from the University of Leeds, U.K., in 2007 and 2010, respectively.

He is currently a Ph.D. candidate in the joint Nanyang Technological University-Technische Universität München PhD Program and under the TUM CREATE Centre for Electromobility, Singapore.



**Suhaib A. Fahmy** (M'01, SM'13) received the M.Eng. degree in information systems engineering and the Ph.D. degree in electrical and electronic engineering from Imperial College London, UK, in 2003 and 2007, respectively.

From 2007 to 2009, he was a Research Fellow at Trinity College Dublin, and a Visiting Research Engineer with Xilinx Research Labs, Dublin. Since 2009, he has been an Assistant Professor with the School of Computer Engineering at Nanyang Technological University, Singapore. His research

interests include reconfigurable computing, high-level system design, and computational acceleration of complex algorithms.

Dr. Fahmy was a recipient of the Best Paper Award at the IEEE Conference on Field Programmable Technology in 2012, the IBM Faculty Award in 2013, and is also a senior member of the ACM.



Ian Vince McLoughlin split his 26 year career equally between the electronics R&D industry and academia, based in five countries on three continents. He became a Chartered Engineer (U.K.) in 1998, A Senior Member of IEEE in 2004, a D'Ingenieur Europeen (EU) in 2008, and a Fellow of the IET in 2013. He is currently a professor in the School of Information Science and Technology at the University of Science and Technology of China (USTC), and has held positions in Nanyang Technological University, Singapore, Tait Electronics Ltd., New

Zealand and Her Majesty's Government Communications Centre, UK. His Ph.D. in electronic and electrical engineering was granted by the University of Birmingham, U.K., in 1997. He has published 200 papers, has 13 patents, several book chapters, and books with Cambridge University Press and McGraw-Hill.