Artigo Acesso aberto Revisado por pares

Low-Power and Low-Hardware Bit-Parallel Polynomial Basis Systolic Multiplier over GF(2 m ) for Irreducible Polynomials

2017; Electronics and Telecommunications Research Institute; Volume: 39; Issue: 4 Linguagem: Inglês

10.4218/etrij.17.0116.0770

ISSN

2233-7326

Autores

Sudha Ellison Mathe, Lakshmi Boppana,

Tópico(s)

Polynomial and algebraic computation

Resumo

ETRI JournalVolume 39, Issue 4 p. 570-581 ArticleFree Access Low-Power and Low-Hardware Bit-Parallel Polynomial Basis Systolic Multiplier over GF(2m) for Irreducible Polynomials Sudha Ellison Mathe, Corresponding Author Sudha Ellison Mathe ellison@nitw.ac.in Search for more papers by this authorLakshmi Boppana, Lakshmi Boppana lakshmi@nitw.ac.in Search for more papers by this author Sudha Ellison Mathe, Corresponding Author Sudha Ellison Mathe ellison@nitw.ac.in Search for more papers by this authorLakshmi Boppana, Lakshmi Boppana lakshmi@nitw.ac.in Search for more papers by this author First published: 11 August 2017 https://doi.org/10.4218/etrij.17.0116.0770Citations: 5 Sudha Ellison Mathe (corresponding author, ellison@nitw.ac.in) and Lakshmi Boppana (lakshmi@nitw.ac.in) are with the Department of Electronics and Communication Engineering, National Institute of Technology-Warangal, India. AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Multiplication in finite fields is used in many applications, especially in cryptography. It is a basic and the most computationally intensive operation from among all such operations. Several systolic multipliers are proposed in the literature that offer low hardware complexity or high speed. In this paper, a bit-parallel polynomial basis systolic multiplier for generic irreducible polynomials is proposed based on a modified interleaved multiplication method. The hardware complexity and delay of the proposed multiplier are estimated, and a comparison with the corresponding multipliers available in the literature is presented. Of the corresponding multipliers, the proposed multiplier achieves a reduction in the hardware complexity of up to 20% when compared to the best multiplier for m = 163. The synthesis results of application-specific integrated circuit and field-programmable gate array implementations of the proposed multiplier are also presented. From the synthesis results, it is inferred that the proposed multiplier achieves low power consumption and low area complexitywhen compared to the best of the corresponding multipliers. 1 Introduction Cryptography is the art of efficiently hiding data and transmitting it over any insecure channel to prevent unauthorized access. Modern cryptography mainly depends on the foundation principles of mathematical theory and computer science. It mainly deals with the development of cryptographic algorithms that are designed around assumptions regarding computational hardness. Cryptography can be broadly categorized as symmetric-key cryptography and asymmetric-key cryptography 1. Symmetric-key cryptography performs the encryption and decryption processes using the same secret key. The techniques developed using this principle are the data encryption standard 2 and the advanced encryption standard (AES) 3. Asymmetric-key cryptography performs the encryption and decryption processes using different keys. Elliptic curve cryptography (ECC) 4, 5 and Rivest and others 6 are some techniques that have been developed using this principle. Many cryptographic algorithms utilize the multiplication operation in finite fields. Basic operations in a finite field, such as addition and subtraction, are realized using the logical exclusive-OR (XOR), whereas complex operations such as division, exponentiation, and inversion are realized using recursive multiplications 7. Therefore, the basic unit for all of these arithmetic operations is the multiplication operation. Multiplication in a finite field can be performed over three basis representations, namely Normal Basis (NB), Polynomial Basis (PB), and Dual Basis (DB) 8, each of which has its distinct advantages. Hardware implementations of NB multipliers typically consume less power compared to other bases, and they are mainly used to realize squaring and exponentiation operations. Multiplications in PB are less complex, but the hardware implementations consume more power compared to NB multipliers. DB multipliers require a low area compared to the other two bases. However, PB multipliers are extensively used because they are less complex and can be matched to any system, whereas NB and DB multipliers require additional hardware for basis conversion. The finite field GF(2m) is characterized by an irreducible polynomial. An irreducible polynomial (T) is said to be irreducible over GF(2m) if T cannot be factored into two product polynomials, where the product polynomials should belong to the above said field. Various types of irreducible polynomials characterize finite-field multipliers. These are general/generic polynomials, trinomials, pentanomials, all-one polynomials (AOP), and equally spaced polynomials (ESP), which are recommended for cryptographic applications. ESPs are polynomials whose terms are equally spaced (in degree) by r. Trinomials, pentanomials, and AOPs are special cases of ESPs, and they are distinguished based on the spacing of the terms. Generic polynomials may have a random number of terms, with random spacing between the terms. The multipliers based on trinomials and pentanomials have relatively lower hardware complexity when compared to other classes of irreducible polynomials, and this is due to a smaller number of terms, and consequently, a lower computational complexity. On the other hand, these multipliers cannot be utilized in all applications owing to the limitation of a fixed number of terms. Multipliers that are based on AOPs require additional hardware complexity compared to multipliers that are based on trinomials or pentanomials, and this is because of a larger number of terms, and consequently, a higher computational complexity. On the contrary, the ease of representation of AOP facilitates efficient implementations and simpler structures in hardware. ESPs are well structured and have higher hardware complexity compared to AOP multipliers. The hardware complexity of multipliers that are based on generic polynomials cannot be determined beforehand owing to the randomness in the number of terms of the polynomials. Therefore, multipliers for generic polynomials are highly generic and operate on any type of polynomial. Various types of architectures have been proposed in the literature for PB multiplication, such as bit-serial 9, bit-parallel 10, digit-serial 11, systolic 12, semi-systolic 13, sequential 14, and pipelined 15. Each type of architecture has its advantages and shortcomings. Systolic architectures require several replicas of the same structure to perform fast computations with high hardware overhead, thereby consuming more power. Hence, low-hardware and low- power systolic multipliers for finite fields are necessary to implement the encryption methods efficiently in portable power-constrained devices. Several systolic multipliers are proposed in the literature for PB multiplication for irreducible polynomials over GF(2m). Yeh and others 16 proposed a systolic PB multiplier in 1984. An algorithm was derived and a systolic multiplier was realized to perform product-sum computations using a parallel-in/parallel-out two-dimensional (2D) structure. Wang and Lin 17 reconsidered an existing algorithm and proposed a parallel-in/parallel-out systolic array in 1991. When compared to 16, this multiplier provided better cascadability, fault tolerance, and the potential for wafer-scale integration, while providing comparable hardware complexity, throughput, and latency. In 1995, Wu and Chang 18 proposed a multiplication algorithm and realized a systolic multiplier from it. When compared with 16, the multiplier achieved low delay, high hardware complexity, and allowed well-ordered inputs into the system, while the latter did not. Jain and Parhi 19 developed a pipelined parallel-in/parallel-out structure in 1995. Their multiplier offered low hardware complexity and low delay when compared to 16 and 17. Jain and others 20 introduced a pipelined semi-systolic structure using the least significant bit (LSB)-first algorithm in 1998. This multiplier achieved low hardware complexity, low delay, and also achieved substructure sharing among multiplier operations. Koc and Acar 21 proposed a Montgomery multiplication algorithm in 1998. Kwon and others 22 presented a bit-parallel systolic structure using an LSB-first algorithm in 2003. When compared to 16 and 17, this multiplier had the same hardware complexity and delay, but achieved a better critical path delay and had unidirectional data flow. Lee and others 23 proposed a multiplexer-based bit-parallel multiplier that was based on the modified Booth's algorithm in 2006. When compared to 17, this multiplier achieved low hardware complexity and low latency. In 2006, Kwon and others 24 proposed a systolic multiplier using an LSB-first algorithm. When compared with 16 and 17, this multiplier had the same hardware complexity and the same delay, but had a lower critical path and had unidirectional data flows. In 2006, Chiou and others 25 proposed a time-independent Montgomery multiplication algorithm and realized a systolic array Montgomery multiplier. The hardware complexity and delay of the multiplication algorithm in 21 was analyzed and compared with the multiplier in 25. It was observed that the multiplier in 25 achieved low hardware complexity and low delay when compared to 21. In 2006, Lee and others 26 presented two algorithms using an interleaved multiplication and a folded technique. Two bit-parallel systolic multipliers were realized from these algorithms, where the two multipliers achieved a lower hardware complexity than 16 and 17. Moreover, both of the multipliers achieved a low delay when compared to 17 and same delay when compared to 16. A 2D systolic array multiplier using a linear feedback shift register was proposed by Chiou and others 27 in 2007. When compared to 16 and 17, this multiplier achieved low hardware complexity and low delay. Lee 28 presented a time-dependent and time-independent algorithm in 2008, and two bit-parallel systolic multipliers were realized by employing these algorithms. The two multipliers achieved low hardware complexity when compared to 16 and 17. The multiplier in 28a achieved the same delay as 16, and achieved low delay when compared to 17. The multiplier in 28b achieved high and low delay when compared to 16 and 17, respectively. In 2008, Lee 29 proposed a multiplexer-based systolic multiplier from a multiplication algorithm that used the cut-set systolization method and modified Booth's recoding. This multiplier achieved a comparable hardware complexity when compared to 16 and 17. It also achieved high and low delays when compared to 16 and 17, respectively. Fournaris and Koufopavlou 30 proposed a Montgomery multiplication element from an optimized Montgomery multiplication algorithm in 2008. When compared to 17 and 20, this multiplier achieved low hardware complexity. However, it achieved low and high delays when compared to 17 and 20, respectively. In 2009, Kwon and others 31 proposed a systolic multiplier from an LSB-first algorithm. This multiplier achieved high hardware complexity when compared to 16, 17, and 31b, but achieved low delay when compared to 17 and 31b. In 2014, Kim and Jeon 32 proposed a cellular semi-systolic Montgomery structure from an optimized algorithm to achieve low time complexity. In 2016, a systolic multiplier was proposed by Mathe and Boppana 33 over GF(28), and achieved better hardware complexity and power consumption when compared to other GF(28) multipliers. In this paper, to reduce hardware complexity, a modified interleaving multiplication method is proposed based on a conventional interleaving multiplication method. The proposed method performs multiplication with interleaved modular reduction of two arbitrary elements for a generic, field-defining, irreducible polynomial. Subsequently, a scalable bit-parallel systolic polynomial basis multiplier over GF(2m) is realized based on the multiplier proposed in 33. The proposed multiplier achieves low hardware complexity and low power consumption when compared to corresponding multipliers available in the literature. The multiplier achieves a latency of m clock cycles, which is comparable to corresponding multipliers. The organization of this paper is as follows: the algorithm formulation is presented in Section II, and the proposed bit-parallel polynomial basis systolic multiplier is explained in Section III. In Section IV, the results and discussion are presented, while the conclusions are made in Section V. 2 Algorithm Formulation The conventional polynomial basis (PB) representation and multiplication of the field elements over GF(2m) are presented in this section 34. Definition 1: Let T(x) be the irreducible polynomial of degree m over the field GF(2m). Then, (1)where t0, t1, … , tm−1 ∈ GF(2). Definition 2: Let γ ∈ GF(2m) be a root of T(x). Then, the following set constitutes the polynomial basis in GF(2m) (2) Definition 3: In Ω, the elements of GF(2m) are polynomials with a degree of at most m – 1. Then, the set of all polynomials in GF(2m) is (3)where gi ∈ GF(2); for i = 0, 1, 2, … , m – 1. Definition 4: Let g(x) = gm−1.xm−1 + ··· + g1.x + g0 be a polynomial in GF(2m); then, the binary representation of this polynomial is (4)where gi ∈ GF(2), and gm−1 is the most significant bit, while g0 is the LSB; for i = 0, 1, 2, … , m – 1. Let the polynomials A(x) and B(x) be the two field elements, T(x) be the field-defining irreducible polynomial, and C(x) be the final product polynomial. Then, (5) The product of A(x) and B(x), each with degree of at most m – 1, results in an intermediate polynomial given by (6) The polynomial D(x) with a degree of at most 2m – 2 is modular reduced by a degree m irreducible polynomial T(x) to obtain C(x) with a degree of at most m – 1, which is the final result of the multiplication operation. (7) Many multiplication methods are available in the literature to perform efficient multiplication operations over the polynomial basis for a finite field GF(2m). In this study, a conventional interleaving multiplication method 35 is modified in an efficient manner to obtain a modified interleaving multiplication method, which is presented below. Let the two arbitrary elements A(x) and B(x) in GF(2m) be expressed as (8) Let D(x) ∈ GF(2m) be the product polynomial of the two elements A(x) and B(x). (9) Here, D(x) can be reduced to a summation of A(x)xi, as can be observed from (9) if bi = 1, for all i = 0, 1, … , m – 1. The XOR operation is considered for the summation of A(x)xi because the addition is simply an XOR operation in polynomial representation. This summation runs from b0A(x) to bm−1A(x)xm−1 in m iterations. The modular reduction in the conventional interleaving multiplication method 35 is performed by (10) Equation (10) is evaluated for each i as follows For i = 0: (11) A degree m polynomial cannot modulo divide a degree m – 1 polynomial. Hence, this step can be skipped. For i = 1: (12) From (12), it can be observed that if am−1 = 1, the modular reduction step is reduced to the summation of the irreducible polynomial, (tm−1, … , t1, t0), and the left-shifted value of a, (am−2, … , a1, a0, 0), as depicted in the following equation (13) If am−1 = 0, the result of the modular reduction step is simply the shifted value of a, that is, (am−2, … , a1, a0, 0), as depicted in the following equation (14) The above interpretations hold true in a similar manner for all i = 2, … , m – 1. For each i, a is left-shifted. If am−1 = 0, a is forwarded to the output. If am−1 = 1, a is XORed with t and forwarded to the output. The summation is computed using the XOR operation, as discussed earlier. Therefore, the multiplication operation in a finite field can be performed in an interleaved manner using (9) and (12). 3 Proposed Bit-Parallel Polynomial Basis Systolic Multiplier In this section, a bit-parallel systolic multiplier for irreducible polynomials over GF(2m) is proposed based on the modified interleaving multiplication method presented in the previous section. The operations in (9) and (12) can be performed recursively over m iterations to obtain the multiplication operation over GF(2m). A signal flow graph (SFG) is realized from (9) and (12), as shown in Fig. 1(a). The SFG consists of m addition nodes X(j), m decision nodes Y(j), and m – 1 modular reduction nodes Z(j). The logic of these nodes is shown in Figs. 1(b)–(d). Here, A0 is the binary representation of A(x), Aj+1 is the result of the modular reduction of Aj for the jth iteration, Pj+1 is the result of the decision node for the jth iteration, Xj is the result of the addition node X(j) for the jth iteration, T is the binary representation of the irreducible polynomial T(x), bi is the ith coefficient of B(x), and C is the binary representation of the final product C(x). Node X(j) performs a bit-addition operation according to (9) and (12). It performs addition on the partial results Pj and Aj using the XOR operation. Node Y(j) performs the decision (or selection) operation according to (9) and (12). It selects between the partial results Pj and Xj using the selection input bi. Node Z(j) performs the modular reduction of the degree by one according to (12). It reduces the degree of Aj by one, and Aj+1 gives the result. Here, i denotes the index of the coefficient of the polynomial under consideration, and j denotes the iteration count. Figure 1Open in figure viewerPowerPoint SFG derived from the proposed multiplication method: (a) SFG, (b) logic of the Y(j) node, (c) logic of the X(j) node, and (d) logic of the Z(j) node. Figure 2(a) shows the proposed cut-set retiming of the SFG, which is performed to obtain a pipelined structure with a reduced critical path. The proposed cut-set retiming allows one addition node, one decision node, and one modular reduction node in each iteration of the cut-set, thus enabling a reduced critical path. It also eliminates the data dependency between the addition node and the modular reduction node by performing them in a single iteration. The critical path after the proposed cut-set retiming amounts to max{TXN, TYN, TZN}, where TXN, TYN, and TZN are the computation times of the addition node, decision node, and modular reduction node, respectively. Other variations of the cut-set retiming have been considered, and the cut-set retiming proposed in Fig. 2(a) is found to provide a good trade-off between the critical path and the latency. Further, the achieved trade-off is found to be comparable to, or better than similar structures that are reported in the literature. Each iteration of the proposed cut-set is wrapped into a single entity called a processing element (PE). Figure 2Open in figure viewerPowerPoint Cut-set retiming of the SFG: (a) proposed cut-set retiming and (b) realization of PE. Figure 2(b) shows the grouping of the nodes of the retimed SFG into PEs based on the proposed cut-set. It can be observed that the PEs obtained from such a grouping of the nodes enables the realization of a regular and modular design consisting of one addition node, one decision node, and one modular reduction node in each PE. The addition node is realized using one XOR operation, the decision node is realized using one 2:1 multiplexing (MUX) operation, and the modular reduction node is realized using one XOR operation and one 2:1 MUX operation. Figure 3(a) shows the systolic design consisting of m PEs realized from the proposed cut-set retimed SFG given in Fig. 2(b). The m – 1 regular PEs (that is, PE[0] to PE[m – 2]) perform the addition, decision, and modular reduction operations concurrently, whereas the mth PE (that is PE[m – 1]) performs only the addition and decision operations concurrently, which is in accordance with the proposed cut-set retimed SFG. The functions of these two types of PEs are shown in Figs. 3(b)–(c). Each regular PE receives Aj, Pj, bi, and T as inputs, and computes the new Aj+1 and Pj+1 values for the next iteration. The mth PE receives Am–1, Pm–1, and bm–1 from the (m – 1)th regular PE, and produces the final result of the finite-field multiplication, C. The regular PE and PE[m – 1] are further decomposed into 2m U-cells and m U-cells, respectively, to derive a regular, scalable structure, and is much simpler for implementation and optimization. Figure 3Open in figure viewerPowerPoint Proposed systolic multiplier realized from the SFG using PEs: (a) systolic multiplier, (b) logic of the regular PE (PE[0] to PE[m – 2]), and (c) logic of PE[m – 1]. The decomposed systolic structure realized using the U-cells is shown in Fig. 4. The first set of m U-cells (corresponding to the addition and decision nodes) of a regular PE are represented column-wise in the upper block, and the second set of m U-cells (corresponding to the modular reduction nodes) of the same PE are represented column-wise in the lower block. The m U-cells (corresponding to the addition and decision nodes) of the mth PE are represented in the upper block in the rightmost column. The first set of m2 U-cells perform the polynomial multiplication operation corresponding to (9), and the second set of (m2 – m) U-cells perform the modular reduction operation corresponding to (12). The inputs to each cell are pi,j, ai,j, bi, and ai,j, ti, am−2,j for the upper and lower blocks, respectively. The values pi,j, ai,j and ti are the ith bit values of the m-bit Pj, Aj and T, respectively, and am−2,j is the (m – 2)th bit value of Aj. Here, i denotes the index of the coefficient of the polynomial under consideration, and j denotes the iteration count. It may be noted that the ai,j in the modular reduction block is right shifted by one bit according to (12). Figure 4Open in figure viewerPowerPoint Proposed systolic multiplier design using U-cells for irreducible polynomials over GF(2m). The details of the circuit and the function of a U-cell are shown in Fig. 5. Each U-cell consists of one XOR gate and one 2:1 MUX. According to (9), the XOR and MUX in the U-cell for the upper block are derived from the addition node and the decision node, respectively. According to (12), the XOR and MUX in the U-cell for the lower block are derived from the modular reduction node. It can be observed that the pipelining registers applied for the systolic structure in Fig. 4 enable concurrent operations such that the critical path is minimized to TX + TM, where TX and TM are the delays of an XOR gate and a 2:1 MUX, respectively. The gate count of the structure is (2m2 – m) XOR gates, (2m2 – m) MUX gates, and m2 latches. The multiplier produces the first output with an initial latency of m clock cycles followed by one output for every clock cycle. Hence, the throughput is one output per clock cycle, with an initial latency of m clock cycles. Figure 5Open in figure viewerPowerPoint Logic diagram of the U-cell. 4 Results and Discussion In this section, the analysis of the hardware complexity and delay of the proposed bit-parallel systolic multiplier are presented, and a comparison is made between the corresponding systolic multipliers for irreducible polynomials that are available in the literature. The comparisons of the application specific integrated circuit (ASIC) and field programmable gate array (FPGA) synthesis results of the proposed multiplier with the corresponding multipliers available in the literature are also presented. 4.1 Comparison of Hardware Complexity and Delay As discussed in the previous section, the proposed systolic multiplier requires (2m2 – m) XOR gates, (2m2 – m) MUXs, and m2 latches. The critical path and latency of the proposed systolic multiplier are obtained as TX + TM and m clock cycles, respectively. The proposed structure gives one output in every clock cycle, with an initial latency of m clock cycles. Table 1 shows a comparison of the hardware complexity, latency, and critical path of the proposed systolic multiplier with corresponding systolic multipliers 16-32 that are reported in the literature for generic irreducible polynomials. Here, TA, TX, TM, T3X, T4M denote the delays of a 2-input AND gate, 2-input XOR gate, 2:1 MUX, 3-input XOR gate, and 4:1 MUX, respectively. The National Institute of Standards and Technology has proposed the use of five irreducible polynomials for cryptographic applications, that is, 163, 233, 283, 409, 571. The polynomial f(x) = x163 + x7 + x6 + x3 + 1 is taken as an example to compare the hardware complexity and delay of various systolic multipliers available in the literature. Traditional CMOS logic is used to estimate the hardware complexity, wherein the transistor counts are six transistors for a 2-input XOR gate, 2-input AND gate, and a 1-bit 2:1 MUX, 16 transistors for a 1-bit 4:1 MUX, and eight transistors for a 1-bit latch. To estimate the delay, real-time circuits from STMicroelectronics are considered, where the typical propagation delays of gates used in the various designs are: 2-input XOR gate (M74HC86) tPD = 12 ns, 2-input AND gate (M74HC08) tPD = 7 ns, 2:1 MUX (M74HC257) tPD = 11 ns, and 4:1 MUX (M74HC153) tPD = 16 ns. A 3-input XOR gate is realized using two 2-input XOR gates. Therefore, the hardware complexity and propagation delay of a 3-input XOR gate are estimated as twelve transistors and tPD = 24 ns, respectively. Table 2 gives the hardware complexity (total transistor count), latency (number of clock cycles), critical path delay (CP), total delay, and percentage reduction in the hardware complexity of all the multipliers considered. Here, the total delay is obtained as the product of the latency and critical path delay. From Table 2, it is observed that the proposed multiplier achieves low hardware complexity when compared to corresponding systolic structures that are available in the literature. Specifically, it achieves hardware reductions of up to 60%, 60%, 63%, 33%, 33%, 42%, 60%, 48%, 60%, 34%, 27%, 44%, 33%, 27%, 38%, 59%, 20%, 65%, and 33% in hardware for m = 163 when compared to corresponding multipliers 16-32, respectively. The latency, critical path, and total delay of the proposed systolic multiplier are comparable to corresponding multipliers. Table 1. Comparison of hardware complexity and delay of the proposed systolic multiplier with corresponding multipliers for irreducible polynomials over GF(2m) Design #AND #XOR #MUX #latches #clock cycles Critical path delay 16 2m2 2m2 0 7m2 3m TA + TX 17 2m2 2m2 0 7m2 3m T A + T3X 18 2m2 – m 2m2 0 8m2 – 7m 2m – 1 T A + TX 19 2m2 2m2 0 3m2 m + 1 T A + TX 20 2m2 2m2 0 3m2 m + 1 T A + TX 21 2m2 2m2 0 4m2 2m T A + TX 22 2m2 2m2 0 7m2 3m T A + TX 23 m m2 + 2m (m2/2)a 6m2 + 8m 3m/2 T4M + TX 24 2m2 2m2 0 7m2 3m T A + TX 25 2m2 + 3m (m2 + m)b 0 3m2 + 4m m + 1 T A + T3X 26 a m 2 m2 + 2m 0 4m2 + 3m 3m T A + TX 26 b m 2 m 2 0 5m2 3m T A + TX 27 2m2 2m2 0 3m2 m T A + TX 28 a m 2 m2 + 2m 0 4m2 + 3m 3m T A + TX 28 b m 2 m 2 0 5m2 4m T A + TX 29 m 2m + (m2/2)b (m2 + m/2)a 7m2 3m/2 T4M + T3X 30 m2 – m + 1 m2 – 1 2m2 + m – 3 2m2 – m 2m T A + TX 31 2m2 2m2 2m2 + m – 3 7m2 3m T A + TX 32 2m2 + 2m 2m2 + 3m 0 3m2 + 4m m/2 + 1 T A + TX Fig. 4 0 2m2 – m 2m2 – m m 2 m TM + TX a: 4:1 MUX b: 3-input XOR gate Table 2. Comparison of total transistor count, number of clock cycles, critical path, total delay, and % reduction in hardware complexity of the proposed systolic multiplier with corresponding multipliers for m = 163 Design #transistors #clock cycles CP (ns) Total delay (ns) % reduction in hardware complexity 16 2,125,520 489 19 9,291 60 17 2,125,520 489 31 15,159 60 18 2,326,988 325 19 6,175 63 19 1,275,312 164 19 3,116 33 20 1,275,312 164 19 3,116 33 21 1,487,864 326 19 6,194 42 22 2,125,520 489 19 9,291 60 23 1,660,644 245 28 6,860 48 24 2,125,520 489 19 9,291 60 25 1,285,418 164 31 5,084 34 26 a 1,175,882 489 19 9,291 27 26 b 1,541,002 489 19 9,291 44 27 1,275,312 163 19 3,097 33 28 a 1,175,882 489 19 9,291 27 28 b 1,382,566 652 19 12,388 38 29 2,076,620 245 40 9,800 59 30 1,061,438 326 19 6,194 20 31 2,445,308 489 19 9,291 65 32 1,284,114 82 19 1,558 33 Fig. 4 848,252 163 23 3,749 N/A Figure 6 shows the comparison of the hardware complexity of the proposed systolic multiplier with corresponding systolic multipliers over a range of m (specifically, m = 2 to 600). From Fig. 6, it is observed that the proposed multiplier achieves low hardware complexity compared to the best of the corresponding multipliers, that is, the multiplier in 30, and the difference in their hardware complexities increases as the order of the finite field increases. Hence, the proposed systolic multiplier achieves better performance in terms of hardware complexity for higher-order finite fields, which is suitable for asymmetric-key cryptographic schemes. The proposed multiplier and the multiplier in 30 were modeled in Verilog, and their functionalities were simulated using the Xilinx Vivado 2014.2 verification tool. Figure 6Open in figure viewerPowerPoint Comparison of hardwar

Referência(s)