Efficient digit‐serial modular multiplication algorithm on FPGA

Artigo Acesso aberto Revisado por pares

Efficient digit‐serial modular multiplication algorithm on FPGA

2018; Institution of Engineering and Technology; Volume: 12; Issue: 5 Linguagem: Inglês

10.1049/iet-cds.2017.0300

ISSN

1751-8598

Autores

Jeng‐Shyang Pan, Pengfei Song, Chunsheng Yang,

Tópico(s)

Numerical Methods and Algorithms

Resumo

IET Circuits, Devices & SystemsVolume 12, Issue 5 p. 662-668 Research ArticleFree Access Efficient digit-serial modular multiplication algorithm on FPGA Jeng-Shyang Pan, Jeng-Shyang Pan Fujian Provincial Key Lab of Big Data Mining and Applications, Fujian University of Technology, People's Republic of China Innovative Information Industry Research Center, Harbin Institute of Technology, Harbin, People's Republic of China Department of Information Management, Chaoyang University of Technology, Taichung City, TaiwanSearch for more papers by this authorPengfei Song, Corresponding Author Pengfei Song sapodillas@163.com Innovative Information Industry Research Center, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this authorChun-Sheng Yang, Chun-Sheng Yang Innovative Information Industry Research Center, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this author Jeng-Shyang Pan, Jeng-Shyang Pan Fujian Provincial Key Lab of Big Data Mining and Applications, Fujian University of Technology, People's Republic of China Innovative Information Industry Research Center, Harbin Institute of Technology, Harbin, People's Republic of China Department of Information Management, Chaoyang University of Technology, Taichung City, TaiwanSearch for more papers by this authorPengfei Song, Corresponding Author Pengfei Song sapodillas@163.com Innovative Information Industry Research Center, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this authorChun-Sheng Yang, Chun-Sheng Yang Innovative Information Industry Research Center, Harbin Institute of Technology, Harbin, People's Republic of ChinaSearch for more papers by this author First published: 17 May 2018 https://doi.org/10.1049/iet-cds.2017.0300Citations: 7AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract For cryptographic applications, such as DSA, RSA and ECC systems, the crypto-processors are required to perform modular multiplication (MM) on large integers over Galois field. A new digit-serial MM method is presented by using a variable size lookup table. The proposed modular multiplier can be designed for any digit-size d and modulus M which only requires simple operations such as addition and shifting. Based on theoretical analysis, the efficient digit-serial MM architecture requires the latency of clock cycles. As a result, the developed architecture can achieve less area–delay product on hardware when compared with previous designs. 1 Introduction Modular multiplication (MM) with large integers is one of the essential operations in public-key cryptosystems. Many cryptographic protocols, such as the RSA scheme [1], elliptic curve digital signature algorithm [2] and Diffie–Hellman key exchange [3], require time-consuming operation of modular exponentiation, which consists of MM operations. For high-performance applications, efficient implementation on hardware is necessary. In order to speed up MM, two significant methods are performed. One adopts the most-significant bit first (MSB-first) methods which are improved on the classical MM in [4-6]. Other uses the least-significant bit first methods which are based on the Montgomery multiplication in [7-12]. The Montgomery modular multiplier (MMM) can be optimised by replacing costly division with multiplication operation and precomputation [13]. The most hardware architecture of MMM realised with the bit-serial structure in [14-17]. The bit-serial multipliers have a low space complexity without store the precomputation lookup table, but they need a large amount of clock cycles to calculate MM when processing one bit a time. To have tradeoffs between space and time complexities, a radix-r Montgomery architecture using a lookup tables is proposed in [12, 18]. However, with the increasing of r, the partial products needs not but no more than clock cycle where the n is the bit length of the modular multiplier [19]. Therefore it is very difficult to extend these structure to support larger radix r. In this paper, a new digit-serial modular reduction (MR) scheme has been proposed. There are two advantages over the other MR method. One is that it allows for an arbitrary bit length of the input value when compared with Barrett reduction [20] and Montgomery reduction [21]. Other is that the modulus can be calculated for any digit-size d and modulus M. Based on the MR scheme, the digit-serial MM can be performed efficiently on hardware which is only requires simple operations such as addition and shifting. The design of digit-serial modular multiplier with a variable size lookup table achieves a latency of clock cycles. This paper is organised as follows: Section 2 briefly reviews the classical MM and Montgomery multiplication. Section 3 develops the digit-serial MR. Based on the reduction scheme, a digit-serial MM is performed by steps. An efficient digit-serial hardware architecture is presented for MM in Section 4. The complexity analysis and Field Programmable Gate Array CLA Carry-Lookahead Adder (FPGA) implementation results are presented in Section 5. Finally, Section 6 discusses conclusions. 2 Preliminaries 2.1 Classical MM Let be the modulus, where the radix is and . Given two integers and in the residue class of M, then the general MM is be defined as The most obvious way of calculating the remainder relies on the division operation in steps 4 and 5 which the detail of classical MM is illustrated in Algorithm 1. In addition, the Barrett MR is similar, but the algorithm uses a precomputed parameter to perform the division which is realised as . Algorithm 1.Classical MMInput: modulus , where and , integers , , where Output: 1: 2: f or down to 0 do 3: 4: 5: 6: end for 7: return Z 2.2 Montgomery multiplication Let be an odd modulus, where and . Then the Montgomery multiplication of the integers X and Y is defined as follows: The Montgomery multiplication Z can be computed instead of division operation with a constant of . Each element of the residue class , which is mapped to Montgomery domain, is represented as , where . The Montgomery multiplication satisfies the following relations: (1) From (1), the Montgomery transform of is equivalent to computation of MM . Let , where , then the Montgomery MR of can be estimated without division as follows: where is depended on the precomputed constant . Assume that the operands X, Y are represented by bits, where and , then Montgomery multiplication of X and Y is (2) Z needs a final subtraction of the modulus M as follows: where Nevertheless, the condition of is required there required no more than one subtraction to reduction. The radix-r Montgomery multiplication is described in Algorithm 2 (see Fig. 1). Fig. 1Open in figure viewerPowerPoint Algorithm 2: Montgomery multiplication 3 Proposed MM scheme 3.1 Modular reduction Given a modulus and an integer , where , and , then the ordinary MR is defined as (3) Obviously, it is equivalent to , where . Lemma 1.Let , where Z, X, M, , then the modulus of X shifted by is .Let be a vector obtain by X, where . Then the concatenation of V is . X can be padded zeros in front of the MSB until .As mentioned above, the modulus can be represented as (4) According to (4), each element in Z can be represented as (5) where . It can speed up the performance of modulo by using precomputed value in (5).Nevertheless, left shifting by ni bits can be rewritten as (6) Lemma 2.Let and be the contiguous elements of , there exists the following relations: where .Then the modulus (4) can be represented as (7) Let be an element of , where and , then the shifting can be performed as follows: (8) where . Each iteration performs the function of . Assume that , where an overflow occurs, then the iteration remainder satisfies: (9) where , , and . Practically, the maximum of E can be obtained as (10) Thus, the bounds of , as follows: (11) According to (4), the solution of the recursive relation can be obtained as (12) where . There needs a final subtraction of the modulus M in (12).Therefore, the digit-serial MR can be executed in clock cycles. Considering the computation (9), a look up table can be generated by to enhance the performance, where Algorithm 3 (see Fig. 2) shows the pseudocode for the radix-2 digit-serial MR. Fig. 2Open in figure viewerPowerPoint Algorithm 3: proposed digit-serial MR 3.2 Modular multiplication Let be a modulus, where and . Given two integers , , where , then MM can be represented as (13) According to (7), Z can be rewritten as (14) Let be the element of Y, where , , and . Then the concatenation of is . Y can be padded zeros in front of the MSB until . Then the result of Z is reconstructed as (15) As mentioned above, the remainder can be iterated by , which can be performed by using Algorithm 3 (Fig. 2). Lemma 3.Let , where , and , then the upper bound satisfies the inequality .Based on (15), the maximum of iteration is . Thus Z can be estimated as (16) where , and variable . Then the bounds and the maximum length of E is . Assume that where an overflow occurs, then the iteration remainder satisfies: (17) where . However, there needs a final subtraction of the modulus if .Based on (12) and (17), the proposed digit-serial multiplication scheme is stated as Algorithm 4 (see Fig. 3). Fig. 3Open in figure viewerPowerPoint Algorithm 4: proposed digit-serial MM 4 Hardware architecture In this section, the architecture of Algorithm 3 (Fig. 2), denoted as MR-New module, is realised on hardware. Based on the MR-new architecture and Algorithm 4 (Fig. 3), a new digit-serial MM architecture can be implemented efficiently. 4.1 Fast adder The performance of proposed MR can be essentially improved by an efficient adder. The traditional CLA is composed of XOR, AND and OR gates, the Block Modified Carry Lookahead Adder [22] logic circuit uses NAND gates to perform CLA. A hierarchical structure can be used for long bit length. The metamorphosis of partial full adder is shown in Fig. 4. The bit Modified Carry Lookahead Adder (MCLA) can be implemented by the basic Partial Full Adder (PFA) unit, where is the block of MCLA circuit and K is the number of bit in each block. When the length of operand increases, it is not practical to use the standard CLA because of the fan-out for carry becomes very large. However, a hierarchical can perform better, and each block uses an Ripple Carry Adder and a Carry Save Adder (CSA), this method could reduce area delay product for fast adder. Fig. 4Open in figure viewerPowerPoint Partial full adder Let i be the hierarchy of stage. PFA has three inputs , and , and three outputs , and . Then the carry of next stage can be expressed as (18) The higher-bit standard CLA can be implemented by the same logic circuit easily. NAND and NOT gate can replace the AND gate of output and simplify the logic circuit. It reduces a gate delay in each stage. For multi-operand adders, there are different tree architectures like Kogge–Stone adder and Brent-Kung adder to optimise system performance. 4.2 MR architecture The hardware architecture of Algorithm 3 (Fig. 2), denoted as MR-new module, is shown in Fig. 5, which consists of two 2-to-1 multiplexers, one fast adder, three registers and one precomputed lookup table. Shift Modular is developed to shift temporary variable in iteration. In addition, the lookup table C can be computed by Tab generator unit in Fig. 6a the select signal of Dual Multiplexer is the carry. As shown in Fig. 6b, the array C can be generated in the ith iteration by the following expression: (19) where . The initial value can still be generated by this module. Fig. 5Open in figure viewerPowerPoint MR architecture using the lookup table Fig. 6Open in figure viewerPowerPoint Precomputed parameter generator(a) Logical circuits, (b) Dual multiplexer unit At the beginning of MR, splitting into a vector V, where each elements has the same width as M, then it can be processed in sequence from MSB. The register stored high bit data after each addition. When performing the for loop, the remainder of Z is after left shifting by d bit. However, if the MSB of becomes high, which means an overflow occurs, the result should accumulate G after clearing the overflow bit which needs one addition operation. Once finished this loop after stages, then it should add Z to the next element in V without shifting. However, it is possible for another overflow, drops it, then adds G to the result. When the iteration reached the last element in V, one subtraction at most would be required to finish the reduction which is equivalent to add G. Finally, the output of Z is equal to . 4.3 MM architecture On the basis of MR-new architecture, a digit-serial MM can be implemented efficiently. As shown in Algorithm 4 (Fig. 3), arrays C and S are first performed. can be generated by in the ith iteration and the corresponding initial values of should be set to X. The logical circuit of array S generator is shown in Fig. 7. The output satisfies the following relation: Fig. 7Open in figure viewerPowerPoint Array parameter generator In order to optimise the architecture of MM-new, a multi-input CSA should be performed. Efficiently implementing Carry Save compressor trees, in terms of area and speed are made possible by using the block redundant adder. In Algorithm 4 (Fig. 3), the operand Y is scanned by digit-size d. There are required d registers to store the array S and each element is where . Then operand should be added in which the length of output is bits and n bits. The block redundant adder is shown in Fig. 8 in which the digit-size is 2 bits and CSA word size is 4. The parallel comparator can also speed up the performance of the MM-new multiplier while estimating the multi-operand adder. Fig. 8Open in figure viewerPowerPoint Multi-operand CSA in redundant form As mentioned above, the MM-new multiplier which is mainly consists of precomputed lookup table unit, MR unit, multi-operand CSA unit and fast adder unit. It is similar to the process of MR. After the array S is initialised in clock cycles, assume that the operand Y is divided into d pieces, then there need clock cycles to calculate the remainder stored in register Z. In each iteration, if the MSB of becomes high, which means an overflow occurs, then Z should be added G in the next iteration. Repeat these operations until reach the end piece of Y A. There may be needed an addition to G if . As a result, the register Z is the modulus of . 5 Evaluation For the purpose of analysis, the TSMC 90-nm cell library information in Table 1 is used to roughly estimate the area and critical path delay. For simplicity, the area requirement and propagation delay of each standard cell are normalised with respect to the full adder. Table 1. Normalised area and delay of the standard cells Cell FA REG Two-input NAND Two-input NOR Three-input NAND Three-input NOR Two-input AND Two-input XOR Three-input XOR 2-to-1 MUX 3-to-1 MUX 4-to-1 MUX area ratio 1.00 0.88 0.16 0.16 0.20 0.20 0.20 0.32 0.68 0.36 0.72 0.96 delay ratio 1.00 — 0.12 0.16 0.20 0.32 0.34 0.34 0.93 0.45 0.63 0.71 The area complexity of MM-new architecture can be approximated by referring to Fig. 9. It is mainly comprised of one multi-operand CSA module, one fast adder module and one MR module for digit-size d. According to Table 1, the area consumption of fast adder is roughly as , then the area complexity of the multiplier can be estimated as , which be approximated to . As shown in Figs. 8 and 9, the maximum delay for calculating temporary value Z is which is about and it needed cycles. When compared with the previous modular multiplier listed in Table 2 where is the required number of pipeline stage of the fast adders and d is the selected digit size, the area of MM-new is larger than the classical MMM in [15], but smaller than that of digit-serial MMM in [23]. In addition, the clock cycles is reduced to the multiplier for small digit size. Fig. 9Open in figure viewerPowerPoint MM-new multiplier Table 2. Complexity comparison of different designs Multiplier Digit-size d Area Cycles Critical path delay Period ADP [15] 1 2.65 [23] 2 4.94 4 8.94 8 12.94 proposed 2 5.13 4 7.13 8 11.13 is the required number of pipeline stage of the fast adders, d is the selected digit size. Table 3 provides the synthesised results of MMM, digit-serial MMM and MM-new, respectively. The platform is Xilinz Vivado 2015.4 tool for Virtex-7, device xc7vx330t, package ffg1157 with speed grade-3. The implementation process used the default Vivado optimisation strategy. It gives the numbers of LUT, flip-flops and the data path delay from TCL command. Compared with MMM and digit-serial MMM, MM-new only used simply operation addition, shifting and comparison. As seen, the MMM has the shortest clock period. However, the proposed multiplier has the lowest area-time-product (ATP) with digit-size . With the increasing of d, the total latency time decreases, but the ATP becomes larger. Table 3. Performances of FPGA Implementation Multiplier Operand size n Digit-size d LUT FF Cycles Data path delay, ns Period, µs Arealatency [15] 512 1 4649 6863 516 1.68 0.87 10.02 1024 1 9277 13,570 1028 2.04 2.10 47.98 2048 1 22,627 26,937 2052 2.30 4.72 233.94 [23] 512 2 4608 3739 266 2.19 0.58 4.84 4 11,541 3903 134 3.08 0.41 6.33 8 20,795 3931 67 6.27 0.42 10.38 1024 2 9304 7492 530 2.23 1.18 19.82 4 22,625 7533 264 3.83 1.01 30.46 8 34,904 7427 132 6.88 0.91 38.52 2048 2 18,636 14,551 1042 3.18 3.31 109.85 4 36,238 15,066 522 4.57 2.39 122.62 8 79,937 14,700 262 7.69 2.01 190.22 Proposed 512 2 4522 1547 260 2.35 0.61 3.70 4 9045 2637 134 3.62 0.49 5.72 8 15,317 4677 74 6.81 0.50 9.99 1024 2 9015 3088 516 2.40 1.24 15.01 4 16,832 5165 262 3.96 1.04 22.88 8 30,339 9279 138 7.03 0.97 38.43 2048 2 18,067 6204 1028 3.92 4.03 97.81 4 33,734 10,315 518 4.81 2.49 109.68 8 62,023 18,489 266 7.85 2.08 167.46 Table 4 provides the implementation results of MM-new on elliptic curve over prime fields. The hardware board is Digilent Nexys 4 DDR which a single 100 MHz crystal oscillator is included. The modulus selected some of the NIST's recommended parameters with digit-size . The hardware implementation uses three-stage state machines and asynchronous reset to eliminate output burrs. In the process of functional verification, the simulation platform uses Intel ModelSim 10.5b Starter Edition and the output data of Testbench are checked by the results of the GNU Multiple Precision Arithmetic Library. The interface with Personal Computer uses the Universal Asynchronous Receiver Transmitter and the baud rate is 57,600 which can communicate with the module under a custom communication protocol. The system clock is set to 10 ns, and other timing constraints are default. Table 4. Implementation results of the multiplier on hardware Modulus M Operand bits LUT FF Slice LUT-FF pairs BUFG Cycles Power 191 1450 988 548 1458 1 100 0.407 223 2655 1160 811 2655 1 116 0.418 255 3064 1317 902 3065 1 132 0.433 383 4131 1966 1172 4132 1 196 0.451 520 4652 2362 1429 2653 1 265 0.466 In summary, the most important improvement of proposed design is the flexible digit size which can decrease the cycles. In addition, the proposed MR algorithm is not limited to the length of the input operand. The proposed multiplication algorithm is essentially the same as the Montgomery multiplication with different carry position. Moreover, the large modulo can be calculated efficiently by basic operation with low hardware complexity. 6 Conclusions In this paper, a new efficient digit-serial MM algorithm is presented to perform the modulo operation in finite fields. The design of digit-serial modular multiplier on hardware with a variable size lookup table achieves a latency of clock cycles. As a result, the developed architecture has area-time efficient in comparison of the related work in the literature. Therefore, the developed algorithm is very suitable for cryptosystem which needed high-speed decryption and encryption. 7 References 1Rivest, R.L., Shamir, A., Adleman, L.: 'A method for obtaining digital signatures and public-key cryptosystems', Commun. ACM, 1978, 21, pp. 120– 126 2IEEE Std 1363-2000: ' IEEE standard specifications for public-key cryptography', 2000 3Diffie, W., Hellman, M.: 'New directions in cryptography', IEEE Trans. Inf. Theory, 1976, 22, pp. 644– 654 4Blakely, G.R.: 'A computer algorithm for calculating the product AB modulo M', IEEE Trans. Comput., 1978, 32, pp. 497– 500 5Takagi, N.: 'A radix-4 modular multiplication hardware algorithm for modular exponentiation', IEEE Trans. Comput., 1992, 41, pp. 949– 956 6Kornerup, P.: ' High-radix modular multiplication for cryptosystems'. Proc. 11th IEEE Symp. Computer Arithmetic (ARITH-11), Windsor, Ont., Canada, 1993, pp. 277– 283 7Walter, C.D.: 'Systolic modular multiplication', IEEE Trans. Comput., 1993, 42, pp. 376– 378 8Koç, Ç.K., Acar, T., Kaliski, B.S.: 'Analyzing and comparing Montgomery multiplication algorithms', IEEE Micro, 1992, 16, pp. 26– 33 9Blum, T., Paar, C.: 'High-radix Montgomery modular exponentiation on reconfigurable hardware', IEEE Trans. Comput., 2001, 50, pp. 759– 764 10Hong, J.H., Wu, C.W.: 'Cellular-array modular multiplier for fast RSA public-key cryptosystem based on modified Booth's algorithm', IEEE Trans. Very Large Scale Integr. Syst., 2003, 11, pp. 474– 484 11Miyamoto, A., Homma, N., Aoki, T., et al.: 'Systematic design of RSA processors based on high-radix Montgomery multipliers', IEEE Trans. Very Large Scale Integr. Syst., 2011, 19, pp. 1136– 1146 12Huang, M., Gaj, K., El-Ghazawi, T.: 'New hardware architectures for Montgomery modular multiplication algorithm', IEEE Trans. Comput., 2011, 60, pp. 923– 936 13Shieh, M.D., Chen, J.H., Lin, W.C., et al.: 'A new algorithm for high-speed modular multiplication design', IEEE Trans. Circuits Syst. I. Regul. Pap., 2009, 56, pp. 2009– 2019 14Tenca, A.F., Koç, Ç.K.: 'A scalable architecture for modular multiplication based on Montgomery's algorithm', IEEE Trans. Comput., 1993, 52, pp. 1215– 1221 15Shieh, M.D., Chen, J.H., Wu, H.H., et al.: 'A new modular exponentiation architecture for efficient design of RSA cryptosystem', IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2008, 16, pp. 1151– 1161 16Kuang, S.R., Wang, J.P., Chang, K.C., et al.: 'Energy-efficient high-throughput Montgomery modular multipliers for RSA cryptosystems', IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2013, 21, pp. 1999– 2009 17Kuang, S.R., Wu, K.Y., Lu, R.Y.: 'Low-cost high-performance VLSI architecture for Montgomery modular multiplication', IEEE Trans. Very Large Scale Integr. Syst., 2016, 24, pp. 434– 443 18Shieh, M.D., Lin, W.C.: 'Word-based Montgomery modular multiplication algorithm for low-latency scalable architectures', IEEE Trans. Comput., 2010, 59, pp. 1145– 1151 19Kaihara, M.E., Takagi, N.: 'A hardware algorithm for modular multiplication/division', IEEE Trans. Comput., 2005, 54, pp. 12– 21 20Barrett, P.D.: ' Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor'. Proc. Int. Cryptology Conf. Advances in Cryptology (CRYPTO '86), 1987, pp. 311– 323 21Montgomery, P.L.: 'Modular multiplication without trial division', Math. Comput., 1985, 44, pp. 519– 521 22Pai, Y.T., Chen, Y.K.: 'The fastest carry lookahead adder'. Proc. Second IEEE Int. Workshop on Electronic Design, Test and Applications (DELTA'04), 2004, pp. 434– 436 23Erdem, S.S., Yank, T., Çelebi, A.: 'A general digit-serial architecture for Montgomery modular multiplication', IEEE Trans. Very Large Scale Integr. Syst., 2017, 25, pp. 1658– 1668 Citing Literature Volume12, Issue5September 2018Pages 662-668 FiguresReferencesRelatedInformation

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Efficient digit‐serial modular multiplication algorithm on FPGA