WO2002033885A1

WO2002033885A1 - Modular multiplication for rsa and other assymetric encryption/decryption

Info

Publication number: WO2002033885A1
Application number: PCT/SE2001/002270
Authority: WO
Inventors: Rolf Sundblad
Original assignee: Novacatus Invest Ab
Priority date: 2000-10-17
Filing date: 2001-10-17
Publication date: 2002-04-25
Also published as: AU2002211130A1; SE0003757D0; SE0003757L; SE517045C2

Abstract

Method and apparatus for modulus multiplication, wherein the multiplication is performed with serial arithmetic, where two parallel computations are performed simultaneously in a first and a second computational chain. The method and the arrangement are preferably used for asymmetric encryption/decryption.

Description

MODULAR MULTIPLICATION FOR RSA AND OTHER ASYMMETRIC ENCRYPTION/DECRYPTION

Field of invention This invention relates to a method and an apparatus to perform modular multiplication applied on asymmetric encryption/decryption, for example according to the so-called RSA (Rivest-Shaman-Adleman)-system.

Background Since data and applications are in a higher and higher degree distributed over public networks the need for protection through for example encryption is increasing. One method for encryption that nowadays is publicly accepted utilizes public keys for so called asymmetric encryption. This type of encryption systems is not only used for encryption but also to perform electronic signatures. The perhaps most well-known and well-established method for encryption and signatures with public keys is performed according to the so- called RSA-system named after its inventors Rivest, Shamir and Adleman, and described in: Rivest et al. U.S. Pat No. 4,405,829 Sep.30 1983 "Cryptographic Communications System and Method".

In the RSA-system two randomly chosen prime numbers, p and q, and their product m are used. A number e is chosen in such a way that e and (p-l)*(q-l) do not have any common factor larger than 1. The number e is often chosen to the number 2^Λ 16+ 1=65537 as a compromise between encryption security and complexity of calculations. A number d is calculated to satisfy (e*d) modulo ((p-l)*(q-l))=l, where the notation "r=x modulo m" means the remainder after an integer division of x with m in such way that x=r+k*m, 0<=r<m, and k is an integer 0<=k<m.

The two pairs of numbers, (e,m) and (d,m), are the public and secret keys respectively for the system. The security of the RSA- algorithm is based on the assumption that the work required to break the system is equivalent to the work required to divide the module m in factors. Since p and q are very large numbers also m will be a large number. Today, there is no known effective algorithm to divide such large numbers in factors in a reasonable time.

When the RS A-encryption is performed in practice, the information to be encrypted is partitioned in blocks represented by numbers smaller then the modulus m. For each of those numbers, M^Λ(e modulo m) is computed to produce an encrypted number C. Decryption is performed by computing C^A(d modulo m); c= M^Λe modulo m and M=C^Λd modulo m. The computations that are performed in RSA-encryption are called modulus exponentiation. Modulus exponentiation means taking the power of one large positive integer number with another large positive integer and computing the result modulus a third large positive integer. This should in practice be performed through a series of modular multiplications, i.e. a modulus reduction is done after each step of the computation. This technique limits the size of the operators to the size of the modulus.

An important problem with modular exponentiation is that it often takes long time to perform such operations with traditional methods such as using an ordinary computer. Modulo multiplication is in general hard to perform in a time efficient manner, especially when the numbers are long, typically >512 bits. In practical implementations modulo multipliers are therefore often realised in some kind of hardware.

Multiplication in general as well as modulo multiplication is an operation that can be performed in many different ways, although it may intuitively be believed that this is not the case. Consider the following example taken from Knuth, D.E. The Art of Computer Programming, Seminumerical Algorithms, Ch. 4.3, in particular multiplication of two numbers (Al A0)*(B1B0), that have been divided in two halves Al and A0, where Al is the most significant half. As a rule the product of two sums is developed into four partial products: (Al*2^Λn + A0)*(Bl*2^Λn+ B0)=A0B0 + AlB0*2^An + A0Bl*2^Λn + AlBl*2^Λ2n, i.e. four multiplications are used. The same product can however also be developed using only three multiplications:

(Al*2^Λn + A0)*(Bl*2^Λn+ BO) = A0B0(l+2^Λn)+AlBl(2^Λn+2^Λ2n)-(Al-A0)(Bl- B0)*2^An. This example shows that there are different possibilities to consider with regard to additions and multiplications when designing a hardware multiplier.

It is well known in the field of digital design that a multiplication is more demanding than an addition regarding the number of components used the consumed silicon area, the time efficiency and the price. The choice between multiplication operations and addition operations may therefore significantly affect one or more of the parameters given above when a hardware multiplier is designed.

When a multiplier is implemented a so-called FPGA can be used as an implementation platform. An FPGA (Field-Programmable Gate Array) is an integrated circuit whose function can be programmed by the user. This technique has recently become a popular replacement for so called ASIC (Application Specific Integrated Circuit) which is an integrated circuit that is designed to perform an application specific function. The reason for this is the high costs for design and production of ASICs.

A FPGA contains among other things programmable elements, so called LUTs (Look Up Table). The designer that is using FPGAs wants to minimise the number of used LUTs and other resources when implementing a certain function with a certain specification regarding speed and throughput of the circuit. Inversely, the designer may request minimal computation time using a specified amount of LUTs and other resources. There are in principle no problems implementing an RSA in hardware. There are however problems in fulfilling demands on computation time, number of used LUTs and other resources, especially for the modulo multiplications.

Description of related art

The previously mentioned RSA-algorithm according to Rivest-Shamir-Adleman (RSA) is the today most widely used asymmetric algorithm for encryption/decryption of data. It is described in for example US-A-4 405 829.

US4852037, Hiromichi Aoki, "Arithmetic Unit for Carrying Out Both Multiplication and Addition in an Interval for the Multiplication" describes a multiplier for multiplication of integers that is suitable for use together with a microprocessor core. The multiplier computes A*B+C. The disadvantage with this multiplier when used for multiplication of large numbers, such as those used in public key systems, is that the multiplier will become unnecessarily large in terms of used silicon area and that the signal paths will be long, which limits the clock frequency.

US4939687, Richard I. Hartley and Sharbel E. Noujaim, "Serial-Parallel Multipliers Using Serial as well as Parallel Addition of Partial Products",Jul. 3, 1990, describes a multiplier for integers which utilises bit serial multiplication technique. This kind of technique, applied in signal processing, is previously published in Wanhammar, Lars, "An Approach to LSI Implementation of Wave Digital Filters", disputation, 24:e April 1981, ISBN 91-7372-440-8. The multiplier in US4939687 is used in signal processing. The bit serial approach gives a smaller multiplier than can be realised with other techniques. The maximum clock frequency is however limited as a function of the length of the used operands due to the long signal path in AIN see Fig.l, US4939687.

US5121431, Michael J. Wiener, "Processor Method of Multiplying Large Numbers" describes an implementation of the RSA system in software on a DSP (Digital Signal Processor).

US5349551, John Petro, "Device for and Method of Performing an N-bit Modular Multiplication in Approximately N/2 Steps" describes a modular multiplier where reduction modulo m is performed by iterative subtraction of modulus (subtraction of l*m, 2*m or 4*m modulus). These operations are performed in iterations and integrated with the multiplication of input operands. This method becomes area consuming, since two parallel subtractions of the modulus have to be done very fast. Furthermore, two units are needed according to US5349551, Fig.2, unit 50 and 52. It can also be noted that these units will limit the maximum clock frequency, if the multiplier should be used for large numbers. A very popular method for modulus multiplication is Montgomery's method described in: Montgomery, P. L., "Modular multiplication without trial division",

Mathematics of Computation, vol. 44, pp. 519-521, 1985. The method has the advantage that the tricky reduction modulo m is rewritten as reduction modulo 2^Λg (for some integer g), which is easy to implement in hardware. The method will however require a complicated control machine, which will give an unnecessary increase in the used chip area. The signals in the control machine are also dependent on the treated data, which will give problems at high clock frequencies. Since the machine has to be designed for a specific length of the modulus it will be very inflexible when it is used with variable length of the modulus. Examples on patents where the Montgomery method for modulus multiplication is used and the previously given issues are discussed are as follows:

US5513133 Compact Microelectronic Device for Performing Modular Multiplication and Exponentiation Over Large Numbers;

US5742530 Compact Microelectronic Device for Performing Modular Multiplication and Exponentiation Over Large Numbers; US5745398 Method for the Implementation of Modular Multiplication According to the Montgomery Method;

US5764554 Method for the Implementation of Modular Reduction According to the Montgomery Method;

US5948051 Device Improving the Processing Speed of a Modular Arithmetic Coprocessor;

US5954788 Apparatus for Performing Modular Multiplication;

US5961578 Data Processor and Microcomputer;

US5982900 Circuit and System for Modulo Exponentiation Arithmetic and Arithmetic Method of Performing Modulo Exponentiation Arithmetic; US5987489 Modular Arithmetic Coprocessor Enabling the Performance of Non-

Modular Operations at High Speed;

US5999953 Method for the Production of a Parameter J.sub.o Associated with the Implementation of a Modular Operation According to the Montgomery Method;

US6026421 Apparatus for Multiprecision Integer Arithmetic; US6088453 Scheme for Computing Montgomery Division and Montgomery

Inverse Realizing Fast Implementation;

US6163790 Modular Arithmetic Coprocessor Comprising an Integer Division Circuit;

US6185596 Apparatus & Method for Modular Multiplication & Exponentiation Based on Montgomery Multiplication;

US6209016 Co-Processor for Performing Modular Multiplication;

US6230178 Method for the Production of an Error Correction Parameter Associated with the Implementation of a Modular Operation According to the Montgomery Method; US6240436 High Speed Montgomery Value Calculation;

EP0502782 Microcircuit for the Implementation of RSA Algorithm and Ordinary and Modular Arithmetic, in Particular Exponentiation, with Large Operands; PCT WO 00/42484(A2) Acceleration and Security Enhancements for Elliptic Curve and RSA Coprocessors; To sum up, the prior art consists of multiplication units that are not directly usable for modulus multiplication; modulus multiplication units based on parallel subtraction and addition that requires a large silicon area and are not usable at high clock frequencies; or multiplication according to the Montgomery method which is used when the length of the modulus can be determined in the design phase.

Object of the invention

The general purpose of the invention is to solve the problem with inefficient modulus multiplication, i.e. the purpose is to accomplish a fast, efficient and resource efficient modulus multiplier.

One aspect of the problem is to accomplish a modulus multiplier that, when it is implemented, will permit a minimization of the silicon area of the circuit, and at least give a smaller area than given by existing comparable multipliers. Other aspects of the problem is as follows: - to accomplish a modulus multiplier that can be run at the highest possible clock frequency limited by the underlying technology;

- to accomplish a modulus multiplier that permits parallel computation of several numbers, which gives a latency time that is significantly shorter than the response time;

- to enable the use of the RSA-system in new demanding applications, through a modulus multiplier that by the lack of a complex control machine can be used at significantly higher clock frequencies than what is possible today;

- to accomplish a multiplier that is flexible in such way that if several multipliers are connected together, they can in parallel process numbers belonging to different modules with different lengths.

Summary of the invention

The invention is based on the inventors understanding about and the application of one for the purpose suitable way of performing modulus multiplication. Modulus multiplication implies, as described in the background, that the equation A*B mod(m) = rest(A*B/m) (4) should be computed. According to the invention, the Equation 4 is reformulated to

A*B mod(m) = A*B - (int(l/m * A * B)) * m (5) where 1/m and m are considered to be constants.

In many applications that include modulus multiplication, for example asymmetric encryption/decryption, the same constants are often used in several operations. In such a case 1/m only has to be computed once initially and no division has to be performed when computing Equation (5). The fact that 1/m is only computed initially and that 1/m and m are considered to be constants provided a fast, effective and resource efficient method to perform modulus multiplication. According to a first aspect of the invention a modulus multiplier is realised that applies the invented relation in hardware based in serial arithmetic. The efficient mathematical algorithm renders a basis for a very area efficient realisation of a modulus multiplier, which thus enables the surface cost to be minimised. According to a second aspect of the invention, the efficient mathematical algorithm combined with hardware architecture according to the invention gives a time efficient modulus multiplier. In reality it is thereby realised a modulus multiplier that can be clocked with the maximum clock frequency permitted by the underlying technology. The simplicity of the control machine, based on independence of the treated data, makes it possible to clock the modulus multiplier at higher clock frequencies thus enabling the use of the RSA-system in new demanding applications.

The hardware structure according to the invention consists of three sequential multiplications preferably divided into two computational chains. This structure provides a modulus multiplier that permits simultaneous computation of several incoming numbers, which gives a time for the throughput that is considerably shorter than the response time. The definition of these concepts is done below.

By using a modular design of the multiplier elements and the computational chains it is realised a multiplier that is scalable and flexible through linking of several multipliers. These multipliers can in parallel compute numbers with different lengths on the modulus. The length of the used modulus controls the used part of the multiplier in such way that numbers with shorter length on the modulus is computed faster. In addition, this gives the architecture the possibility to compute numbers with short modulus without limiting the possibility to compute numbers with long modulus.

Different aspects and embodiments of the invention are achieved by means of the following features.

A method for bit-serial modulus multiplication comprising the following steps: feeding as a first input value a multiplication factor serially to a first computational chain; feeding as a second input value a parallel multiplication factor to the first computational chain; multiplying in the first computational chain all bits in the second input value with all bits in the first input value resulting in a first partial result; completing the multiplication by setting the first input value to zero and further computing a second partial result where the complete result is the second partial result followed by the first partial result.

A further elaboration comprises the steps to: move an in the first computational chain received carry length to a second computational chain; possibly start a new computation with a second input value to the first computational chain while the previous computation is completed in the second computational chain; output serially a first intermediate result from the first computational chain simultaneously with the output of a second intermediate result from the second computational chain;

Yet another elaboration comprises the following: three consecutive sequences of the previously described multiplier sequences; where the first multiplier sequence has a first and a second input value; - where the second multiplier sequence has a constant as first input value and the output from the first multiplier sequence as the second input value.; where the third multiplier unit has a first input value, which is a second constant, and a second input value that is the result from the second multiplier sequence; where a result of the modulus multiplication is formed as the difference between the results from the first and the second multiplier sequence. In a preferred embodiment the input values are arranged where the first input value to the second multiplier sequence is input in a parallel form and the second input value is input in a serial form; where the first input value to the third multiplier sequence is input in a parallel form and the second input value is input in a serial form. Further comprised are: three consecutive sequences of the previously described multiplier sequences; where the first multiplier sequence has a first and a second input value; where the second multiplier sequence has a constant as first input value and the output from the first multiplier sequence as the second input value.; where the third multiplier unit has a first input value, which is a second constant, and a second input value that is the result from the second multiplier sequence; where a result of the modulus multiplication is formed as the difference between the results from the first and the second multiplier sequence. Further comprised that: where the first input value to the second multiplier sequence is input in a parallel form and the second input value is input in a serial form; where the first input value to the third multiplier sequence is input in a parallel form and the second input value is input in a serial form Further comprised are the steps of; computing a first partial result in a first multiplier sequence depending on a first and a second input value; intermediatly storing the result from the first multiplication sequence in a memory; inputting a first input value that is a first constant to the second multiplication sequence and a second input value that is the result from the first multiplication sequence taken from the previously mentioned memory; intermediatly storing the result from the second multiplication sequence in a memory; inputting a first input value that is a second constant to the third multiplication sequence and a second input value that is the result from the second multiplication sequence taken from the previously mentioned memory; intermediatly storing the result from the second multiplication sequence in a memory; creating a result of the modulus multiplication as the difference between the results from the first multiplication sequence and the third multiplication sequence taken from the previously mentioned memory.

Preferably the modulus multiplication is expressed A*B mod(m) = A*B - (int(l/m * A * B)) * m and computed with the steps to: compute the product of a first (A) and a second (B) input value, wherein a first partial result (Pi) is stored; compute the integer part of the product between the first partial result (Pi) and a third input value (1/m), wherein a second partial result (P₂) is stored; compute the product of a second partial result (P ) and a fourth input value (m), wherein a third partial result (P₃) is stored; compute the difference between the first partial result (Pi) and the third partial result (P₃), wherein a final result is obtained. The memory elements included in the first computing chain are put in reset state depending on a reset-signal. The length of the signal paths in the serial arithmetic computing unit is limited with a delay element. The length of the first constant is one or more bits longer than the second constant. Preferably, the length of the first constant is one or more bits longer than the second constant. Furthermore comprising the steps of: storing a first and a second constant in a memory; storing a third input value in a memory; storing an exponent for encryption in an asymmetric crypto-system in a memory; - setting an intermediate result as the mentioned third input value; updating the mentioned intermediate result by performing a modulus multiplication with the intermediate as a first input value and either the intermediate result or the mentioned third input value as the second input value, in dependence of the binary sequence in the encryption exponent; repeating the closest preceding step and successively chose bits in the binary progression of the encryption exponent and thereby modulus exponentiation; saving the final intermediate result as a final result. The progression of selections of the second input value is performed in such way the progression of calculations is independent of the treated data by storing the mentioned intermediate result before the mentioned modulus multiplication and choosing the result as either the stored intermediate result or from the achieved result of the modulus multiplication.

The method for modulus multiplication mentioned above is usefully applied in asymmetric encryption.

The invention is usefully implemented as an apparatus in the form of a bit serial modulus multiplier, comprising: a serial input for a multiplication factor as a first input value to a first computing chain; - a parallel input for a multiplication factor that is a second input value to the first computing chain; a serial arithmetic computing unit realised as the mentioned first computing chain arranged to multiply an actual bit in the first input value on the serial input with all bits in the second input value on the parallel input; - a serial output for a first intermediate result from the mentioned first computing chain.

Preferably the process elements of the mentioned first computing chain comprises a multiplier in the form of an AND-gate, two delay elements and one three input full adder for each element. The mentioned delay elements are arranged in such a way that they are set to zero with a reset control signal.

A first computing element in the first computing chain consists preferably of only an AND-gate and a delay element. The mentioned delay element is usually arranged to be set to zero with a reset control signal. One delay element is arranged at the en of the first computing chain. Furthermore comprising: a second computing chain coupled to the first computing chain; a signal path to move an in the first computing chain acquired carry length to the second computing chain; a serial output for a second intermediate result from the second computing chain.

The mentioned second computing chain comprises process elements each having two delay elements and two multiplexers. The mentioned two multiplexers are controlled by a common control signal. It is furthermore arranged a reset control signal to the delay elements in the first computing chain, where the mentioned reset control signal is connected with and constitutes the control signal to mentioned two multiplexers.

A delay stage is arranged to limit the length of the signal paths in the serial arithmetic computational unit. This is arranged such that also other signal paths, like signals for computed numbers and control signals are delayed, which allows that computed data also after the delay element henceforth are synchronized. In an embodiment there is further comprised: three consecutive modulus multipliers of the previously mentioned type; a memory for storage of constant values; - where the first modulus multiplier has a first and a second input; where the second modulus multiplier has a first input which is communicatively coupled to the mentioned memory and a second input that is communicatively coupled to the output from the first modulus multiplier; where the third modulus multiplier has a first input that is communicatively coupled to said memory and a second input that is communicatively coupled to an output from the second modulus multiplier; a differentiating element communicatively coupled to an output from the first modulus multiplier and an output from the third modulus multiplier; an output from the differentiating element. In the modulus multiplier preferably the first input of the second modulus multiplier is the parallel input and the second input is the serial input. the first input of the third modulus multiplier is the parallel input and the second input is the serial input. Further comprising three consecutive modulus multipliers of the previously mentioned type; a memory for storage of constant values; where the first modulus multiplier has a first and a second input; where the second modulus multiplier has a first input that is communicatively coupled to the mentioned memory and a second input that is communicatively coupled to an output from the first modulus multiplier; where the third modulus multiplier has a first input that is communicatively coupled to the mentioned memory and a second input that is communicatively coupled to an output from the second modulus multiplier; - a differentiating element communicatively coupled to an output from the first modulus multiplier and a second output that is communicatively coupled to an output from the third modulus multiplier; an output from the differentiating element. Further comprising that the first input of the second modulus multiplier is the parallel input and the second input is the serial input; the first input of the third modulus multiplier is the parallel input and the second input is the serial input. The modulus multiplier can be configured such that the mentioned serial input to the modulus multiplier is connected to a multiplexer; the mentioned parallel input to the modulus multiplier is connected to a parallel to serial converter; - the mentioned output from the first computing chain and the mentioned output from the second computing chain is connected to a parallel to serial converter. the mentioned parallel to serial converter is by a parallel port connected to a memory unit.

In a special application of the modulus multiplier like encryption, the multiplier is arranged to: store a first and a second constant in a memory; store a third input value in a memory; store an encryption exponent for an asymmetric crypto system in a memory; put an intermediate result as the mentioned third input value; - update the mentioned intermediate result by performing a modulus multiplication with the intermediate result as a first input value and either the intermediate result or the mentioned third input value as the second input value, depending on the binary number sequence of the encryption exponent; repeat the previous step and successively choosing bits in the binary sequence of the encryption exponent and thus receiving modulus exponentiation; save the final intermediate result as the final result. It is then preferably arranged such that the two choices of the second input value always is done such that the computing sequence is independent of the treated data.

Different embodiments of the apparatus can be applied in different versions for asymmetric encryption/decryption, comprising a modulus multiplier according to any of the previously mentioned aspects.

In a more simplified manner it can be described such that the multiplication is performed with serial arithmetic in such way that two parallel computations are simultaneously performed in a first and a second computational chain. An in the first computational chain received carry length is moved to the second computational chain, and thereafter a new computation is started in the first computational chain.

An arrangement for modulus multiplication comprising a serial arithmetic unit (1) arranged to perform the multiplication, comprises a number of circuits (20) in the serial arithmetic unit (1) including a first and a second computational chain, which are arranged to perform at least two parallel computations simultaneously. Preferably the first computational chain is arranged to move an at the computation received carry length to a second computational chain and then start a new computation.

In one embodiment there is comprised a memory (2) coupled to the arithmetic unit (1), where input data, output data and computed intermediate results are stored and at least one unit (3) for conversion of numbers from a first format to a second format connected to the memory (2) and the arithmetic unit (1).

Definitions. With throughput time is meant the average time for processing in the modulus multiplier according to the invention when it is processing a queue of operations.

With response time or latency time is meant the time from the point when a modulus multiplier according to the invention starts a computation until the modulus multiplier is finished with the computation. With used area is meant the number of used LUTs to realise a modulus multiplier according to the invention. It should in this connection be considered that FPGAs from different manufacturers have different internal design, and thus the number of LUTs cannot always directly be compared between different manufacturers. For an ASIC implementation the area can be directly computed in square millimetres, or be given as the number of used transistors.

A multiplier for modulus multiplication can according to prior art comprise one or more computational units and a control machine, which controls the computational units. When estimating the used area and possible processing time, usually as the maximum allowed clock frequency, it should be noted that it is usually the control machine that limits the maximum clock frequency. One should also include the control machine when the area of the modulus multiplier is estimated, since some methods for modulus multiplication requires a complex control machine.

Brief description of the drawings The invention will in the following be described with reference to the attached drawings, in which:

Fig 1 describes a block diagram of an embodiment for modulus multiplication according to the invention;

Fig 2 shows a detail of an embodiment; Fig 3 shows a flow diagram for an embodiment of the method according to the invention;

Fig 4 shows a modulus multiplier according to an embodiment of the invention implemented as an exponentiation unit;

Fig 5 shows an embodiment of a modulus multiplier based on Barret's method; Fig 6 shows a partial detail view of an embodiment of a modulus multiplier;

Fig 7 shows a multiplier comprising a chain of addition units;

Fig 8 shows a pre-serialisation unit to a multiplier.

Fig 9 shows a pre-parallelisation unit to a multiplier; Fig 10 shows a delay element to a multiplier;

Fig 11 shows a first embodiment of an addition unit in a multiplier;

Fig 12 shows a multiplication unit in the form of a chain of addition units according to Fig 11;

Fig 13 shows a second embodiment of an addition unit; Fig 14 shows a multiplication unit in the shape of a chain of addition units according to Fig 13;

Fig 15 shows an overview of an arrangement with three parallel modulus multipliers according to the invention; and

Fig 16 shows a serial subtraction unit included in different embodiments of the invention.

Detailed descriptions of embodiments of the invention

The invention can be used in different applications, but as an example the invention is explained below with the background of the specific problems met in conjunction with RSA-encryption. According to the RSA algorithm encryption and decryption of a message M is done with the following equations:

C=M^e mod(m) (1)

M=C^d mod(m) (2)

The equation 1 performs the encryption while the equation 2 performs the decryption. C is the encrypted message, e and m are the public keys and d is a secret or private key. M, C, e, m and d are numbers with the length n.

In both cases the operation X^γ mod(m) is performed with n bits long numbers, that nowadays typically can be more than a thousand bits long. This operation is computed according to the following: X^γ mod(m)=rest(X^Y /m) (3)

Doing a number of modulus multiplications performs the computation above. A modulus multiplication implies that the following should be computed: A*B mod(m) = rest(A*B/m) (4)

The algorithm according to the invention and the modulus multiplier utilises the fact that multiplication of large numbers is performed more efficiently with serial arithmetic. Equation 4 has been reformulated as follows:

A*B mod(m) = A*B - (int(l/m * A * B)) * m (5)

Where 1/m and m are considered as constants. In for example encryption decryption the same constants are used in several operations. In such cases it is only needed to initially compute 1/m and consequently no divisions are computed when computing Equation (5).

The hardest part in a modulus multiplication is the modulus reduction. In the paper: Barret, P. "Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor", in Odlyzko, M. (ed.) CRYPTO 86, LNCS 263, pp 31 l-323,Springer-Nerlag, Berlin 1987 it is described how the equation r = A*B mod m is solved as AB=r+k*m. where the remainder 0<=r<m and k are integers. The number k is computed by taking the number (A*B) and multiplying it with an initially computed constant (1/m) in such way that k=IΝT((A*B)*(l/m)). This can be done with integer multiplication, by logically including a binary decimal point at the right place.

We find that a modulus multiplication r=A*B mod m thus can be computed with three integer multiplications, see Fig 3: 6) W1=(A*B) 7) W2=INT(Wl*(l/m)) 8) W3=m*W2 and finally receiving the requested value r by performing a subtraction: 9) r=Wl-W3.

It is obvious that round off errors cannot be tolerated in the operation (7). This can be solved by adding a half (i.e. 0.5) to the number (A*B) before the multiplication. We note that the number (A*B) is 2n bits long, and also that the number (1/m) is 2n bits long. P. Barrets publishes also formulas, where the size of the incorporated operators to the multiplication (2) can be choosen to be less than 2n bits, namely to n+1 bits. However, a new problem is introduced: the number W2 is computed in an interval k0-3<=W2<=k0, where kO is the correct k-value: k0=definition=INT((A*B+l/2)/m). From a general point of view we can set: AB=A*B<m^Λ2. Let us remove n-1 least significant bits from AB: W4=AB-u and for (1/m): W5=((l/m)*2^Λ2n)-v. The factor 2^Λ2n, where n is the length of the modulus, rescales the (l/m)-number to an n bits integer followed by a cyclic infinite binary number expansion. The fraction expansion v for (l/m)*2^A2n can be written as v=(l/m)*2^A2n - INT((l/m)*2^A2n). The multiplication (7) then becomes:

W2=rNT(W4*W5)=INT((AB-u)*(((l/m)*2^Λ2n)-v)). The number u, consisting of the removed bits in AB, cannot be too large. Put u<=2^Λ(n-l)-l.

The Equations INT((AB-u)*(((l/m)*2^Λ2n)-v))<=k0 (10a) k0-l<= INT((AB-u)*(((l/m)*2^Λ2n)-v)) (10b) can now be used to solve the number v, i.e. deciding an interval in which (l/m)*2^Λ2n should be in to make the multiplication (7) give any two possible results: the numbers W2=k0 or W2=k0-1. The subtraction (9) is then performed as two simultaneous and parallel subtractions rl=Wl-W3 and r2=rl-m. If r2 is positive is r=r2 and if r2 is negative is r=r 1. We note that the multiplier (7) must manage one bit or a few bits more than the nominal size n of the operators, which implies a slight increase in delay and used area if all three multiplier supports this. We note also that the problems existing in modulus multiplication, according to

Montgomery's method, can be replaced with a multiplication of two numbers each with the length of approximately n+1 bits, where the exact relation can be extracted from equation lOa+lOb.

Fig 1 shows an embodiment of the invention, where only one multiplier is used iteratively with the aid of a memory for storage of intermediate results. In Fig 1 , reference numeral 1 represents a serial arithmetic unit, in which a multiplication is performed by feeding one of the multiplicands serially at X and the other multiplicand in parallel form at Y. The result is output serially at Rl (Rlow), where the least significant half of the result is fed out, and Rh (Rhigh), where the most significant half of the result is fed out. A memory 2 is used to store intermediate results, but also works as input and output for the operators and the result, respectively. At least one serial/parallel- and parallel/serial- converter 3 receives and sends out serial numbers, but feeds in and out parallel numbers to and from the memory 2. In one embodiment of the invention three converters are used 3a, 3b, 3c. If two n-bit numbers are multiplied a result that is 2n bits long is achieved. To avoid problems with the time, since the operators are read in n clock cycles while it takes 2n clock cycles to read out the result, the carry-result stored in the chain is transferred in parallel form into another parallel computing chain after n clock cycles. Because of this the least significant numbers will be read out at Rl (Rlow) and the most significant bits at Rh (Rhigh). A new computation is started after n clock cycles and the computations are overlapping.

To avoid that overlapping computations are mixed up a multiplexer 5 (MUX) is arranged at the input to the arithmetic unit 1. A subtraction should be performed according to Equation (5). It is performed separately in a simple adder 4, which is either arranged separately, as in Fig 1, or is integrated into the arithmetic unit 1. The serial arithmetic unit 1 comprises a predefined number of circuits 20, which are shown in Fig 2, each of them performing a serial addition of X and Yj, where Yj is the number Y's bit number i. In the preferred embodiment of the invention the arithmetic unit 1 consists of 512 circuits 20, but can of course be fewer or more depending on how long numbers that are wished to be multiplied. The circuit 20 comprises a full adder 21, a half adder 22, an AND-gate 23 and four D-flip flops 24, 25, 26 and 27.

X and Yj are input values to the AND-gate 23. The input values to the full adder are the output values from the AND-gate 23, the previous carry- value through the D-flip flop 24 and the value of the sum from the circuit i-1 with higher bit weight through the D- flip flop 25. The carry values in the D-flip flops 24 and 25 are moved in parallel to the D- flip flops 26 and 27 after n clock cycles, when a new computation is started in the upper computational chain. The inputs to the half-adder 22 are the deposited carry-bits in the D- flip flops 26 and 27. In an alternative embodiment of the invention D-flip flops are arranged at the outputs Rl and Rh and also the X-line after a predefined, amount of circuits 20. This in order to do a practical distribution of X to all circuits 20. An embodiment for this is shown in Fig 10. ,_. _ .-, _.

In a preferred embodiment of the invention (shown in figure 3) the algorithm according to the invention, see Equation (5), is performed according to the method described below: ' 1. The operators A, B and the constants 1/m, m are fed into the memory 2 (step

31);

2. The product A*B is computed (step 32) by feeding A in serial form at X in the arithmetic unit, utilising the converter 3, and B is fed in parallel form down to Y in the serial arithmetic unit 1. After n clock cycles the least significant bits of the result Pi have been fed into the converter 3b and the carry- values will be moved in parallel into the lower computational chain, whereupon the most significant bits of the result Pi are started to be fed into the converter 3 a. The next computational algorithm is started at the same time as the carry- values are moved. The result Pi is stored in the memory 2 (step 33) through the converters 3a and 3b after n clock cycles further; 3. In the same way as in paragraph 2 above the product Int[Pι* 1/m] (step 34) is then computed. The result P₂ is stored in the memory 2 (step 35);

4. The product P₂*m is computed (step 36). The result P₃ is stored in the memory 2 (step 37);

5. Finally the difference Pι-P₃ is computed in a simple full adder (step 38). The result is stored through the converter 3 c in the memory 2 (step 39) and is talcen as output data, as well as it is used for the next computational algorithm or computation according to the algorithm via the multiplexer (MUX) 5.

As a further development of this embodiment it is noted that the number 1/m in algorithm step 34 above is preferably carried out according to the equations 10a and 10b. The consequence of this is that the subtraction in algorithm step 38 is performed as the results

R=P1-P3

R=Pl-P3-m , which has been described above. In Fig 4 a block-diagram over a modulus multiplier is shown as a block diagram implemented as an exponentiation unit consisting of one input unit 4001, one modulus exponentiation unit 4002 and one output unit 4003.

Fig 5 shows schematically a block diagram of a modulus multiplier according to an embodiment of the invention based on Barrett's method. This embodiment comprises an A*B-multiplier 5021, an AB*(l/m)-multiplier 5022, a k*m multiplier 5023 and a subtraction unit. The modulus multiplier in Fig 5 can be used for computation of A*B mod m, or for computing the square A* A mod m by putting A=B. A modulus exponentiation can be broken down into squares and multiplications with algorithms according to prior art, particularly Gordon, Daniel M., "A Survey of Fast Exponentiation Methods", Centre for Communications Research, San Diego Dec 30, 1997.

Fig 6 shows a multiplication unit according to a preferred embodiment of the invention. An input unit 211 delivers to the multiplication unit 5021 bits Ai to be multiplied in serial form with the least significant bit first. For the multiplication unit 5021, this consists of the A-number from unit 211. The second number, which should be multiplied, is delivered to the multiplication unit as bits Bi in parallel form, shown in schematic form with an input unit 212. The least significant bit in the number B is to farthest right in the figure. The numbers are multiplied serially, one bit at a time, in the multiplication unit 41,42 where preferably a special version of the embodiment is used in the first element. If the unit 42 is used as a first multiplication element, a complete addition unit is used for adding zero to a number. This can be simplified, which is shown as a unit 41 in Fig 6.

The unit 41 comprises in the shown embodiment an AND-gate 1601 that receives Ai and Bi as input and feeds its output to delay unit 1602 in the form of a D-flip flop. The output signal from the delay unit 1602 is further fed to a first input of a multiplexer 1603, whose second input is fed with a logical zero. The output of the multiplier is fed to another delay unit 1604 in the form of a D-flip flop. The bit Ai and the output from the delay units 1602 and 1604 are fed to the next series connected multiplication unit.

After the actual multiplier 41, 42 there is a unit for after processing or a selector unit 216, which is arranged to select and j oin the numbers that are fed out from the multiplier 42, in such way that the connection to other units are not unnecessarily obstructed but instead are simplified.

Between the units 42 and 216 there is in the shown embodiment introduced a delay element 43, consisting of one single D-flip flop 605, with the purpose of introducing a phase shift to a signal, as it will be described below, to simplify the selector unit 216. The usefulness and the advantage with a modulus multiplier according to the invention are especially accentuated in a simplicity in the realization, and therefor unnecessary complications in surrounding units should be avoided if possible. This is accomplished as it has been shown in the examples by making simple processing in processing steps that are interfacing with surrounding units.

In Fig. 7 is shown, in schematic form, a chain of addition units, which together forms a serial multiplier 42 (same reference number as in Fig 6). Preferably the first addition unit is simplified, as shown in Fig 3 module 41. For every clock cycle a part of the product is formed by a chain of addition units that multiplies and adds a bit each in the product. Preferably the chain has the same amount of cells for addition, 421, as there are bits in the parallel number B (212).

The serial number A (211 in Fig 6) is fed in serial form to the multiplier, and shall simultaneously be used in all addition units 421. Since the chain can have a thousand or more elements, the signal path for the serial number will be long, and this limits the maximum available clock frequency. Therefore special delay elements 50 are preferably introduced in the chain, as shown in figure 1, which cuts the module in segments. This means that the serial signal path only has to go to a limited number of addition units 412. For simplicity two such delay elements 50 are shown in Fig 7. The introduction of the modules 50 implies that the output signal is delayed the same amount of clock cycles as the number of elements 50 in the chain of addition units 421. This means that the response time has increased. It should however be noted that the throughput time has not been affected, and therefor (depending on the usage of the module) this is possibly of less importance. Especially depending on the simplicity of chains 42 there is no special requirement that the number of delay elements 50 should have any special relation compared to the total length of the chain. Therefore the maximum number of addition units 421 that are connected between the delay stages 50 can be chosen with respect to the required maximum clock frequency, which is a demand for such a multiplier to support. A special case is noted when the multiplier has to be divided, due to special circumstances, in two parts that are placed at some mutual distance, and that two or more units 50 can be placed directly after each other somewhere in the module 42 if so required.

Fig 8 shows an embodiment of the serialisation unit 211 that has the function to deliver a number (for example A) in serial form to the actual multiplier. Depending on the circumstances that occurs where the multiplier is used, there can be a need for converting data of some kind into a serial form, fitting the modulus multiplier. The serialisation unit 211 includes a converter where the numbers comes in on a data-bus 81 (the bits A96...A94) which are loaded into a register, here realised as delay elements in the form of D-flipflops 83 and multiplexers (MUX-2) 82. The register is controlled by a control signal by which an external system via the multiplexers 82 can put a number in the upper chain of D-flip flops 83. The number is then copied to the lower chain of multiplexers 84 and delay elements 85 controlled by a reset signal CLR 605, which gives the number in a serial form with the correct phase to the multiplier. A zero 86 is fed into the first multiplexer (to the left in fig 8) in the lower chain to mark the end of the serial number. The serial input from the serialisation unit has the reference number 88.

Fig 9 shows an embodiment of a unit for parallelisation 212, which has the purpose to convert a number for example from a data-bus to a parallel form suitable for the multiplier. It should be noted that if the multiplier is used with minimal throughput time it is necessary to change the parallel number in synchronisation with the processing of the addition chain 42. Since delay units 50 possibly are used, there is shown a design where an external signal LOAD 75 puts in a number B40...B39 with a multiplexer 71 in an upper chain of delay elements 72. The multiplexer 71 feeds the upper number when the control signal is a logical one, otherwise the lower number is fed through. The reset signal CLR, is the first time named 605 but after the delay elements named 632 and controls then the exchange of the parallel number in a lower chain 74 by the multiplexer 73 controlled by the mentioned signal CLR 632, 605.

Fig 10 shows an embodiment of a delay element 50 consisting of a number of clocked (clock 624) delay elements. The signals that are delayed 621-626 are shown coming in from the left in Fig 10, and coming out being available on the corresponding outputs 631-636 to the right in the figure. The clock 624, here schematically shown, is normally not included as an explicit signal.

In Fig 11 there is schematically shown a preferred embodiment of an addition unit 421, comprising a chain of full adders, so called FA 606, a chain of half adders 610, an arrangement (604, 607) controlled by a control signal and arranged for copying the state of the FA-chain to a second chain of states (608,612). This arrangement has the purpose to release the FA-chain 606 for the new multiplication of a new number when the serial number Ain 601 is ended, simultaneously with the completion of the multiplication in the lower chain and output of the most significant half of the product. The input signal Ain 601 is the incoming binary number for one of the factors in a serial form, the input B6 602 is the second binary number in parallel form, and the output signal Ain 621 is a serial stream of output numbers. Furthermore are comprised a multiplication element 603 for two binary numbers, a so called AND-gate, an incoming control signal 605 which makes it possible to reset (CLR) the FA-chain 605 and connections for reset of the delay elements 604,607. Furthermore there are an incoming clock 613, an incoming bit 614 which is added in 606, a delay element 607 to feed out the sum to the next unit 624 and a delay element 604 for the remainder.

For the FA unit 606 the following is valid, the sum S=ul XOR u2 XOR u3, the remainder (carry bit) C=(ul AND u2) OR (u2 AND u3) OR (u3 AND ul). The upper signal path in multiplexers (609,611) is chosen when the control signal is a logical one. The half adder works in the same way as the full adder FA, except for that it has only two inputs and therefor the sum S = ul XOR u2 and the remainder (carry) C= ul AND u2.

The shown logical implementation of the serial adder for multiplication and addition of binary numbers is especially advantageous for implementations in ASIC and FPGA. However, there are many different types of FPGAs and it is suitable to mention that the implementations of the multiplier and the adder in a different number base than two, for example four, can give more efficient solutions. When using the base four the multiplier unit shall multiply two numbers (0..3)*(0..3), and thus it will have four binary inputs and four outputs. The serial and the parallel signal paths should then also be extended to the base four (two bits). Apparently this implementation is more complex, which is illustrated in the preferred embodiment of the multiplication and addition units 421 shown in the figures.

In Fig 12 there is shown a multiplication unit consisting of a first stage 41 (see also fig 6) followed by two units of the type 421 as described in Fig 11. Thereafter there are a delay unit 50 and another unit 421. To the right in the figure is shown the mentioned delay 43 (see Fig 6), here in detail in its real context. Fig 12 shows how the unit 50 breaks up the horizontal signal paths. The unit in the picture is shown with separate control functions CLR and DOWN, where CLR resets the numbers in the FA-chain and where DOWN copies a state from the upper chain to the lower chain. The picture shows a direct reset of all delay units in the in the FA-chain, which thereby is done in one clock cycle. The same cycle is used for copying to the HA-chain in a so-called synchronous reset. When implemented in an FPGA all delay elements are implemented with a connection for a synchronous reset, thus this is not a particular problem. Thereby the two signals DOWN and CLR can preferably be combined into one signal, which gives a significant simplification when implementing into an FPGA, due to the lower number of control signals. When implementing into an ASIC it can be chosen only to use reset on particular delay elements, whereby CLR possibly needs to be active during many cycles, while the DOWN signal only is active during the first CLR-cycle. Hence the figure is drawn with separate control signals. The delay units related to unit 50 are shown with reference number 50.

In Fig 13 there is shown a multiplier unit 42 IB that is elaborated differently than 421, where the units in the embodiment form 42 IB lacks half adders. This gives however an extra signal path. This embodiment can, depending on the used type of FPGA or ASIC, be more advantageous than the other.

In Fig 14 there is shown, as an example of an embodiment, an implementation in a chain according to the model in the previous figure (Fig 13), with double signal paths in the lower computational chain. To make the two methods comparable a FA-adder has been included at the end of the chain, which results in identical output data from the two alternatives. The delay units belonging to the unit 50 are shown with reference number 50. Two delay units, belonging to the last unit 421, are included and shown with reference number 421.

In Fig 15 there is shown a modulus multiplier arranged according to the invention corresponding to Fig 5, with three consecutive multiplication units (213,214), (217,218) and (222,223). The arrangement has a serial input Ain (226), for example arranged according to the serialisation unit 211 in fig 8, and a parallel input Bin, for example arranged according to the parallelisation unit 212 in Fig 9. The multiplication of the numbers A and B is performed in the multiplication unit 213, which gives the least significant half of the product to the switching unit 216A in serial form with the least significant bit first. The most significant half is computed by the multiplication unit 214, after copying with the signal 616 (see unit 421 in Fig 11).

The (n+1) least significant bits in the product (n bits from 213 and one bit from 214) are moved by the switching unit 216A to a memory (a so-called shift register) 219. The n+1 most significant bits (i.e. one bit from 213 and n bits from 214) are transferred as serial input data to the multiplication unit 217.

The parallel number in the input unit 216B, for example implemented as a parallelisation unit 212, is used for multiplication in unit 217. The switching unit 220 is set to feed the n+1 most significant bits to unit 222, and this is the number W2 in Equation (2). This gives a truncation of the n-1 least significant bits (unit 217). The unit 220 reads the serial input signal from unit 219, and sends it out in a serial form to the unit 224 (this is the A*B-number).

Finally the number m is multiplied with W2 according to Equation (3) in the unit 222. The module 225 is arranged to form the difference between the n+1 least significant bits (n bits from 222 and one bit from 223) and the AB number, which is received from the unit 224. The subtraction is performed according to Equation (4). Furthermore one more subtraction with the number m is done in such a way that the two numbers rl and r2 are computed (rl=Wl-W3, r2=rl-m). One output 227 is drawn where the numbers rl and r2 are fed out in serial form. We note that the subtraction can be performed in unit 225, with two serial subtraction units, simultaneously with the computation of the product k*m in unit 222.

The units 213,217 and 222 can be realised as three similar units when multiplications are performed according to the invention. Moreover, the units 214, 218 and 223 can be realised as three similar units. If then it is chosen to implement the units 216A, 220 and 225 as a combination of these units, and it is chosen to activate the function in the function units respectively (216 A, 220 and 225) according to the above description, it is found that a modulus multiplier can preferably be elaborated with three consecutive identical units.

Moreover, in an embodiment especially suited for implementation in FPGAs, the functionality in the units (216A, 220, 225) is increased in such way that these three units can work as delay units 50. Since the units (216A, 220, 225) only are using a fraction of the total area, a flexible modulus multiplier can be implemented as a long chain of units according to the figure.

Through configuration of the included units a logical unit (212, 213, 214, 215, 216 A) is formed, with a length adapted to the length in bits of the processed numbers. The configuration of the other units can be done in the same way. This procedure implies that it is possible to flexibly divide a long chain in multiple independent units depending on the length of the processed numbers.

The throughput time, in an embodiment according to the figure, is approximately n+1 clock cycles for the multiplication A*B (result to 219). Thereafter follows approximately n+1 clock cycles for the multiplication (AB)(l/m) in unit 217. Simultaneously the number (AB) is transferred from 219 to 224. Finally there follows approximately n+1 clock cycles when the number AB from 224 is subtracted with the product from 222, and the numbers rl and r2 are formed. The word "approximately" in this context refers to the fact that the configuration should be according to Equations 10a, 10b and that no time for possible delay units 50 has been incorporated.

It is important to note that in modulus multiplication, according to Fig 15, each part of the multiplier is used only once, which means that an elaboration according to Fig.15 implements three logical modulus multipliers in one physical modulus multiplier, the three of which can operate independently. It turns out that the throughput time is a third of 3*(n+l) clock cycles, i.e. n+1 clock cycles, independent of the number of used delay units 50. If a large number of delay units 50 are used, for example n+1 pieces, an apparatus according to the figure can be regarded as four independent logical modulus multipliers. Fig 16 shows a bit-serial subtraction unit 70 that is included in embodiments of the invention where numbers in bit serial form are subtracted. For example in Fig 1 unit A, Fig 6 unit 216 or Fig 15 units 216 A, 220 and 225. The subtraction unit 70 is put into a zero state with reset control signal 702. There is arranged an input Z 703, another input X 704 and an output DIFF 705 in such way that DIFF=Z-X. With 708 is meant an inverter, 707 is a delay unit, 709 is a multiplexer, 710 is a constant logical one and finally 706 is a three input full adder.

The invention and its different elaborations have been described above with the aid of exemplifying embodiments and with reference to block diagrams and logical circuit diagrams of various level of complexity. The figures indicate how the different components are communicatively and/or by signals coupled.

Claims

1. A bit-serial computation method for modulus multiplication, comprising the steps to: - input as a first input value a multiplication factor in a serial form to a first computational chain; input as a second input value a multiplication factor in a parallel form to the first computational chain; to multiply in the first computational chain all bits of the second input value with all bits of the first input value and receive a first partial result; complete the multiplication by resetting the first input value and then computing a second partial result where the final result consists of the second partial result followed by the first partial result.

2. Method according to claim 1 , further comprising the steps to: transfer one in the first computational chain received carry-length to a second computational chain; possibly start a new computation with other input values in the first computational chain while the previous computation is completed in the second computational chain; output in a serial form a first partial result from the first computational chain, while simultaneously a second partial result is output from the second computational chain.

3. Method according to claim 2, further comprising - three consecutive sequences of the previous multiplication steps; where the first multiplication sequence has a first and a second input value; where the second multiplication sequence has a first input value that is a constant and a second input value that is the result from the first multiplication sequence; where the third multiplication sequence has a first input value that is a second constant and a second input value that is the result from the second multiplication sequence; where a result from the modulus multiplication is formed as the difference of the results from the first and the third multiplication sequences.

4. Method according to claim 3, wherein the first input value of the second multiplication sequence is input in a parallel form and the second input value is input in a serial form; the first input value of the third multiplication sequence is input in a parallel form and a second input value is input in a serial form.

5. Method according to claim 1, further comprising: - three consecutive sequences of the previous multiplication steps; where the first multiplication sequence has a first and a second input value; where the second multiplication sequence has a first input value that is a first constant and a second input value that is the result from the first multiplication sequence; where the third multiplication sequence has a first input value that is a second constant and a second input value that is the result from the second multiplication sequence; where a result from the modulus multiplication is formed as the difference of the results from the first and the third multiplication sequences.

6. Method according to claim 5, wherein the first input value of the second multiplication sequence is input in a parallel form and the second input value is input in a serial form; the first input value of the third multiplication sequence is input in a parallel form and a second input value is input in a serial form.

7. Method according to claim 1 , further comprising the steps to: compute a first partial result in a first multiplication sequence dependent on a first and a second input value; intermediately store the result from the first multiplication sequence in a memory; input to the second multiplication sequence a first input value that is a first constant and a second input value that is the result from the first multiplication sequence taken from said memory; intermediately store the result from the second multiplication sequence in a memory; input to the third multiplication sequence a first input value that is a second constant and a second input value that is the result from the second multiplication sequence taken from said memory; intermediately store the result from the third multiplication sequence in a memory; form a result from the modulus multiplication as the difference of the results from the first and the third multiplication sequences taken from said memory.

8. Method according to claim 1 , where the modulus multiplication is expressed A*B mod(m) = A*B - (int(l/m * A * B)) * m and is computed with the steps to: compute the product of a first (A) and a second (B) input value, wherein a first partial result (Pi) is stored; - compute the integer part of the product of a first partial result (Pi) and a third input value (1/m), wherein a second intermediate result (P₂) is stored; compute the product of the second partial result (P₂) and a fourth input value (m), wherein a third partial result (P₃) is stored; compute the difference between the first partial result (Pi) and the third partial result (P₃), wherein a final result is received.

9. Method according to claim 2, wherein the memory elements comprised in the first computational chain are set to zero in dependence of a reset signal.

10. Method according to claim 1, wherein the length of the signal paths in the serial arithmetic unit are limited with a delay element.

11. Method according to claim 3, wherein the length of the first constant is one or more bits longer than the length of the second constant.

12. Method according to claim 5, wherein the length of the first constant is one or more bits longer than the length of the second constant.

13. Method according to claim 3 , further comprising the steps to : - store a first and a second constant in a memory; store a third input value in a memory; store a crypto exponent for an asymmetric crypto system in a memory; set an intermediate result as said third input value; update said intermediate result by performing a modulus multiplication with the intermediate result as a first input value with either the intermediate result or said third input value as the second input value, dependent on the binary sequence in the crypto exponent; repeat the previous step and successively choosing bits from the binary sequence in the crypto exponent and thereby receiving a modulus exponentiation; - save the final intermediate result as the final result.

14. Method according to claim 13, where the sequence of choices of the second input value is performed such that the computational sequence is independent of the processed data by storing said intermediate results before said modulus multiplication and choosing the result as either the stored intermediate result or as the result achieved from the modulus multiplication.

15. Method for asymmetric encryption and/or decryption, comprising modulus multiplication according to any of the claims 1 - 14.

16. A bit serial modulus multiplier, comprising: a serial input for a multiplication factor as a first input value to a first computational chain; - a parallel input for a multiplication factor as a second input value to the first computational chain; a serial arithmetic computational unit realised as said first computational chain arranged to multiply a current bit in the first input value at the serial input with all bits in the second input value at the parallel input; - a serial output for a first partial result from said first computational chain:

17. Modulus multiplier according to claim 16, wherein said first computational chain comprises process elements each having a multiplier in the form of an AND-gate, two delay elements and a three-input-full adders.

18. Modulus multiplier according to claim 17, wherein said delay elements are arranged to be put to zero with a reset control signal.

19. Modulus multiplier according to claim 16, wherein a first computational element in the first computational chain preferably only consists of an AND-gate and a delay element.

20. Modulus multiplier according to claim 19, wherein said delay element is arranged to be put to zero with a reset control signal.

21. Modulus multiplier according to claim 16, wherein a delay element is arranged at the end of the first computational chain.

22. Modulus multiplier according to claim 16, further comprising: - a second computational chain connected to the first computational chain; a signal path to transfer an in the first computational chain received carry length to the second computational chain; a serial output for a second partial result from the second computational chain.

23. Modulus multiplier according to claim 22, wherein said second computational chain comprises process elements each having two delay elements and two multiplexers.

24. Modulus multiplier according to claim 23, wherein said two multiplexers are controlled by a common control signal.

25. Modulus multiplier according to claim 24, wherein it is arranged a reset control signal to the delay elements in the first computational chain, and wherein said reset control signal is connected with and constitutes a control signal to said two multiplexers.

26. Modulus multiplier according to claim 16, further comprising a delay element arranged to delimit the lengths of the signal paths in the serial arithmetic computational unit.

27. Modulus multiplier according to claim 22, further comprising three consecutive modulus multipliers of the previously mentioned type; - a memory for storage of constant values; where the first modulus multiplier has a first and a second input; where the second modulus multiplier has a first input that is communicatively coupled to said memory and a second input that is communicatively coupled to an output from the first modulus multiplier; - where the third modulus multiplier has a first input that is communicatively coupled to said memory and a second input that is communicatively coupled to an output from the second modulus multiplier; a difference element that is communicatively coupled to an output from the first modulus multiplier and an output from the third modulus multiplier; - an output from the difference element.

28. Modulus multiplier according to claim 27, where the first input of the second modulus multiplier is the parallel input and the second input is the serial input; - the first input of the third modulus multiplier is the parallel input and the second input is the serial input.

29. Modulus multiplier according to claim 16, further comprising three consecutive modulus multipliers of the previously mentioned type; a memory for storage of constant values; where the first modulus multiplier has a first and a second input; where the second modulus multiplier has a first input that is communicatively coupled to said memory and a second input that is communicatively coupled to an output from the first modulus multiplier; where the third modulus multiplier has a first input that is communicatively coupled to said memory and a second input that is communicatively coupled to an output from the second modulus multiplier; a difference element that is communicatively coupled to an output from the first modulus multiplier and an output from the third modulus multiplier; an output from the difference element.

30. Modulus multiplier according to claim 16, wherein the first input of the second modulus multiplier is the parallel input and the second input is the serial input; the first input of the third modulus multiplier is the parallel input and the second input is the serial input.

31. Modulus multiplier according to claim 22, wherein - said serial input to the modulus multiplier is connected to a multiplexer; said parallel input to the modulus multiplier is connected to a parallel to serial converter; said output from the first computational chain and said output from the second computational chain are connected to a parallel to serial converter; - said parallel to serial converter is through a parallel port connected to a memory unit.

32. Modulus multiplier according to claim 22, arranged to: store a first and a second constant in a memory; - store a third input value in a memory; store a crypto exponent for an asymmetric crypto system in a memory; set an intermediate result as said third input value; update said intermediate result by performing modulus multiplication between the intermediate result as a first input value and either the intermediate result or said tliird input value as the second input value, dependent on the binary sequence in the crypto exponent; repeat the previous step and successively choosing bits in the binary sequence in the crypto exponent and thereby achieve modulus exponentiation; save the final intermediate result as the final result.

33. Modulus multiplier according to claim 32, wherein it is arranged such that both selections of the second input value always are performed in such way that the computational sequence is independent of the processed data.

34. Apparatus for asymmetric encryption and decryption, comprising a modulus multiplier according to any of the claims 16-33.

35. Method for modulus multiplication, wherein the multiplication is performed with serial arithmetic, characterized in that two parallel computations are performed simultaneously in a first and a second computational chain.

36. Method according to claim 35, characterized in that an in the first computational chain received carry length is transferred to the second computational chain, and thereafter a new computation is started in the first computational chain.

37. Method according to claim 36, characterized in that the modulus multiplication is expressed as A*B mod(m) = A*B - (int(l/m * A * B)) * m and is computed with the steps to: compute the product of a first (A) and a second (B) input value, wherein a first partial result (PI) is stored; compute the integer part of the product between the first partial result (PI) and a third input value (1/m), wherein a second intermediate result (P2) is stored; - compute the product between the second intermediate result (P2) and a fourth input value (m), wherein a third intermediate result (P3) is stored; compute the difference between the first intermediate (PI) and the third intermediate result (P3), wherein a final result is received.

38. Apparatus for modulus multiplication comprising a serial arithmetic unit (1) arranged to perform multiplication, characterized in that a number of circuits (20) in the serial arithmetic unit (1) comprises a first and a second computational chain that are arranged to perform at least two parallel computations simultaneously.

39. Arrangement according to claim 38, characterized in that the first computational chain is arranged to transfer an in the computation received carry length to the second computational chain and thereafter start a new computation.

40. Arrangement according to claim 38 or 39, characterized in that it further comprises a memory (2) connected to the arithmetic unit (1), wherein input data, output data and computed partial results are stored, and in at least a unit (3) for conversion of data from a first format to a second format connected to the memory (2) and the arithmetic unit (1).

41. Use of the method or the apparatus according to any of the claims 1 - 40 for asymmetric encryption and/or decryption.

AMENDED CLAIMS

[received by the International Bureau on 22 March 2002 (22.03.02); original claims 35-40 replaced by new claims 35-38; original claim 41 renumbered as claim 39 ; remaining claims unchanged (2 pages)]

35. Method for modulus multiplication, wherein the multiplication is performed with serial arithmetic, characterised in that the product of a first (A) and a second (B) input value is computed by means of a first computational chain in a serial arithmetic unit (1), wherein the least significant bits of the result (Pi) are transformed from a first format to a second format and are thereafter stored in a memory (2), in that in said first computational chain received carry-length is transferred in parallel to a second computational chain in the arithmetic unit (1), wherein the most significant bits of the result (P^ are transformed from the first format to the second format and are thereafter stored in the memory (2), in that a new computation of the first computational chain is started in parallel when the carry-length is transferred to the second computational chain.

36. Apparatus for modulus multiplication comprising a serial arithmetic unit (1) arranged to perform multiplication, characterised in that a plurality of circuits (20) in the serial arithmetic unit (1) comprises a first and second computational chain, in that the first computational chain is arranged to receive a first input value (A) having a first format and a second input value (B) having a second format and to compute the product of the input values, in that a conversion unit (3) is arranged to convert the least significant bits of the result (Ri) of the product from the first format to the second format and to store this in a memory (2), in that the first computational chain is arranged to transfer in parallel a carry-length received from the first computation to the second computational chain, in that the conversion unit (3) is arranged to convert the most significant bits (Rh) of the result from the first format to the second format and to store this into said memory, in that the first computational chain is arranged to start a new computation when the carry-length has been transferred to the second computational chain.

37. Apparatus according to claim 36, characterised in that the conversion unit (3) is a serial/parallel converter and in that the first format is a serial format and the second format is a parallel format.

38. Use of the method according to claim 36 for asymmetric encryption/decryption.

39. Use of the method or the apparatus according to any of the claims 1 - 38 for asymmetric encryption and/or decryption.