Performance Tuning Projects

Optimizing Crypto++ library for P4
Project Book

Submitted by: Tzachi Zidenberg
To: Koby Gottlieb

Contents

Project Background

Goal

Crypto++

Benchmarks

Initial Vtune Sampeling output

Explenations

Microarchitectonic Problems

Addition

64bit-Add

64bit*2-Add

16bit*8-Add

Testing the functions.

Subtraction

RSA

Implementing threading

Results

Exponentiation

Background: Right-to-Left Sliding Window Exponentiation

Cryptopp's implementation of the exponentiation

Threading Intentions

Threading the function - version 1

Threading the function - version 2

Implementing Threading

Results

final program

Threading profiles

Threading checks

Speedup results

Source and executables

Appendix - optimal window size

Project Background

Goal

The goal of this project is to take an open-source program and try and speed it up using hardware considerations. These are to consider the Intel Pentium 4 processor. Speedup attempts are to follow these three considerations:
Microarchitechtural problems
For example reducing branch mispredictions or cache misses
SIMD instructions
Single Instruction Multiple Data instructions, either MMX or SSE2 may be used to allow the same calculations to be done on different data at once
Hyper Threading
This feature of Pentium 4 processors allows two threads to run at the same time, meaning a certain speedup might be achieved by deviding the work into multiple threads. In these attempts some thought is also given to Dual Processor machines, as the industry today seems to be going in that direction.
back to top

Crypto++

Crypto++ is an open-source library implementing a very wide range of cryptographic algorithms. The library was written and placed in the public domain by Wei Dai. It implements alot of the academical reaserch done in the field to make it's calculation as efficient as possible. The library has optional code that is activated according to weather or not the machine supports SSE2 instructions.
Crypto++ is one of the mostly appreciated softwares in the field of cryptography.
The library comes with an example/benchmark/testing program called cryptest.
back to top

Benchmarks

Several benchmarks were used to determine the performance of the program and it's created variants.
Original Benchmark
The program originally comes with this this benchmark, that runs each of the tested actions for a given number of secounds (default 1) and creates an HTML file detailing the number of runs, their duration and the average throughput. The private keys used are read from files, and the data encrypted/signed is automaticly generated once for any test. This benchmark was only used to retrieve original assesments and final testing.
Runs-Based Benchmark
The main problem with the initial benchmark was that it did not have a fixed amount of work to do, so this variant was created where a test does not run for a given number of secounds, but for a given amount of work. Only then speedup could be seen in decreased CPU time. Furthermore - I believe these tests to be more accurate because the CPU-clock is tested only twice - at the beginning and the end of the work.
When testing exponentiation threading techniques I used this benchmark with new tests - that only exponentiate Integers or multiply Elliptic Curve points.
Test Vectors
Testvectors are files including tests for the crypto++ library (such as encryption request and it's result). Even though they were not declared as benchmarks, the workload of this tests is split in the same manner of the original benchmark. It's main advantege is that it runs on precomputed data - and therefore it's the least "random" test and the most stable one.
back to top

Initial Vtune Sampeling output

Clockticks Instructions Retired 64k Aliasing conflicts Mispredicted Branches Retired MOB Loads Replays Retired Clockticks per Instructions Retired (CPI) 64K Aliasing Conflicts Performance Impact Branch Mispredict Performance Impact Mispredicted Branches per Instructions Retired Store Forward Performance Impact
Scale 1,000,000 1,000,000 10,000 10,000 1,000 1 1 1 1 1
Entire Program 500,214 545,604 220,829 130,608 175,288 0.917 5.298 5.222 0.002 1.752
P4_Mul 62,853 65,010 56,496 1,711 9,651 0.967 10.786 0.544 0 0.768
Add 49,332 63,687 17,828 24,305 4,158 0.775 4.337 9.853 0.004 0.421
Subtract 32,046 36,909 13,889 20,043 3,408 0.868 5.201 12.509 0.005 0.532
ProcessAndXorBlock 23,919 45,363 101 16 8,362 0.527 0.05 0.013 0 1.748
Multiply8 23,610 27,576 5,430 5,688 9,633 0.856 2.76 4.818 0.002 2.04
Transform 14,535 11,208 315 0 2,229 1.297 0.26 0 0 0.767
Multiply 12,597 14,655 2,889 8,345 810 0.86 2.752 13.25 0.006 0.321
Multiply4 11,079 8,262 2,253 166 149 1.341 2.441 0.299 0 0.067
AlmostInverse 10,719 11,721 150 10,013 3,884 0.915 0.168 18.684 0.009 1.812
RecursiveMultiply 10,503 13,170 3,661 7,494 2,663 0.797 4.182 14.27 0.006 1.268
MultiplicativeInverse 9,864 11,211 482 12,126 781 0.88 0.586 24.586 0.011 0.396
Divide 6,864 6,324 2,819 2,807 5,735 1.085 4.928 8.179 0.004 4.178
RawProcessBlock 6,507 8,985 6 0 49 0.724 0.011 0.001 0 0.038
Iterate 6,030 7,491 18,114 0 101 0.805 36.048 0 0 0.084
back to top

Explenations

These results came from using the original benchmark. Only the functions with 6e9 clocktick events or more are shown. It is obvious that a very small number of functions shares the big majority of the running time. Most of them belong to the mathematical engine. P4_Mul is a functions used many times in any large-number multiplication and takes most of the multiplication's workload. Add adds long numbers. Both functions are usually called through large number modulus exponentiations.
ProcessAndXorBlock is a function used by one of the suymmetrical-keys algorythms. I have chosen to place the focus of this project on the mathematical engine, and on the public key systems it supports.
back to top

Microarchitectonic Problems

First of all, it is apparent that the program does not suffer from severe Microarchitectonic problems. This can be seen by the CPI (clockticks per instructions) ration, that stands on 0.917, which is considered to be a rather-good result.

64k aliasing

64k aliasing happends when a procedure works on two data segements that are placed on cache lines that have exactly 64k between them. The problem is much more severe in case of mixed stores and loads then just loads. In this case it is not certain where exactly is the collision common (since the event is not exact), and weather or not this problem is truly serious. Therefore, since the data is located on the heap on many different portions of the code, and large parts of it by the user it was accepted as unavoidable.

Branch misprediction

Branch misprediction is a relativley severe problem of the code. Given he fact that this benchmarks does the same actions on the same data over and over - the problem might even be more serious then it appears here.The problem mostly exsists inside the addition and subtraction functions. These functions were entirely rewritten later on.

MOB Load Replays

MOB loads happen when a load could not have been executed for a long time and it was resent to the processor. The most common reason for this to happen is when data is read in a different manner then it was written shortly before. For example - two 32bit variants written and soon after read as one 64bit variant (this prevents the CPU from using a "Store Forward") .
In Crypto++, this mostly happends because different parts of the code process the same long integer with different mechanisms - normal 32bit registers, 64bit MMX registers or 128bit XMM registers. That was referred by changing the Addition function to work with 64bit registers instead of 32bit ones. Attempts to read information into 128bit registers and split it only later into two 64bit registers did decrease the MOB Load Replays counter, but did not create any speedup. As the use of XMM and MMX functions proved to be worthwhile, MOB Loads were also accepted as unavoidable.
back to top

Addition

The second most significant function in the code turned out to be addition of long integers. The function recieves as operands the two addition-operands and the result - which are pointers to three long integers stored in the memmory, and their size (in Words, which are 32-bit unsigned integers). The three operands are of the same size, which is always even. Memmory for the result is assumed to have been previously allocated. The function returns the carry from the operation (of type word).
The cryptopp library has three implementations for the function Add:
Portable
using C function calls. This implementation calculates additions of words inside a double-word to avoid carry. That word's lower half is used for the result and it's upper half for carry. This solution is far less efficient then the following two.
Pentium Optimized
Written in Assembler, using the intrinsic command "addc", which adds carry from previous addition along with the two operands. view code
CRYPTOPP_NAKED word Add_POpt(word *C, const word *A, const word *B, unsigned int N)
{

	AddPrologue

	// now: ebx = B, ecx = C, edx = A, esi = N
	AS2(	sub ecx, edx)	// hold the distance between C & A so we can add this to A to get C

	AS2(	xor eax, eax)	// clear eax

	AS2(	sub eax, esi)	// eax is a negative index from end of B

	AS2(	lea ebx, [ebx+4*esi])	// ebx is end of B

	AS2(	sar eax, 1)		// unit of eax is now dwords; this also clears the carry flag

	AS1(	jz	loopendAdd)		// if no dwords then nothing to do

	AS1(loopstartAdd:)
	AS2(	mov    esi,[edx])			// load lower word of A

	AS2(	mov    ebp,[edx+4])			// load higher word of A

	AS2(	mov    edi,[ebx+8*eax])		// load lower word of B

	AS2(	lea    edx,[edx+8])			// advance A and C

	AS2(	adc    esi,edi)				// add lower words

	AS2(	mov    edi,[ebx+8*eax+4])	// load higher word of B


	AS2(	adc    ebp,edi)				// add higher words
	AS1(	inc    eax)					// advance B


	AS2(	mov    [edx+ecx-8],esi)		// store lower word result

	AS2(	mov    [edx+ecx-4],ebp)		// store higher word result


	AS1(	jnz    loopstartAdd)			// loop until eax overflows and becomes zero

	AS1(loopendAdd:)
	AS2(	adc eax, 0)		// store carry into eax (return result register)


	AddEpilogue
}
hide code
Pentium 4 Optimized
Written in assembler. Comments inside the code claim that the "addc" function is very expensive on Pentium 4. These are therefore replaced with "cmovc" and "jc" functions - "move in case of carry" and "jump in case of carry". view code
CRYPTOPP_NAKED word Add_P4Opt(word *C, const word *A, const word *B, unsigned int N)
{

	AddPrologue

	// now: ebx = B, ecx = C, edx = A, esi = N
	AS2(	xor		eax, eax)

	AS1(	neg		esi)
	AS1(	jz		loopendAddP4)		// if no dwords then nothing to do

	AS2(	mov		edi, [edx])

	AS2(	mov		ebp, [ebx])
	AS1(	jmp		carry1AddP4)

	AS1(loopstartAddP4:)
	AS2(	mov		edi, [edx+8])

	AS2(	add		ecx, 8)
	AS2(	add		edx, 8)

	AS2(	mov		ebp, [ebx])
	AS2(	add		edi, eax)

	AS1(	jc		carry1AddP4)
	AS2(	xor		eax, eax)

	AS1(carry1AddP4:)
	AS2(	add		edi, ebp)

	AS2(	mov		ebp, 1)
	AS2(	mov		[ecx], edi)

	AS2(	mov		edi, [edx+4])
	AS2(	cmovc	eax, ebp)

	AS2(	mov		ebp, [ebx+4])
	AS2(	add		ebx, 8)

	AS2(	add		edi, eax)
	AS1(	jc		carry2AddP4)

	AS2(	xor		eax, eax)

	AS1(carry2AddP4:)

	AS2(	add		edi, ebp)
	AS2(	mov		ebp, 1)

	AS2(	cmovc	eax, ebp)
	AS2(	mov		[ecx+4], edi)

	AS2(	add		esi, 2)
	AS1(	jnz		loopstartAddP4)

	AS1(loopendAddP4:)

	AddEpilogue
}
hide code
All three implementations calculate two words of the result in each loop.

As I was using a Pentium 4 pc, the initial results I had were of the Pentium 4 function. Vtune Analisis showed branch misprediction to be a major microarchitecture problem. I tried several solutions to the problem, all of them using MMX functions. I tested these solutions one against each other both amongst themselves and inside the general program.
back to top

64bit-Add

64bit is the maximum size that a single assembler command can currently add, using "paddq". The problem is that this addition for itself leaves no sign as to weather or not an overflow occured. 64bit comparisons are also unavailable. And so, I had to come up with a way to descide weather or not there is a carry from the addition. After adding: result = a + b + previous_carry we can conclude that there was a carry if: This is true without regard to the value of previous_carry (that can only be '1' or '0'). So carry is calculated from the msb of a,b and result using binary operands: carry = (a&b) | ((!res) & (a|b)) The result will be in the msb of the carry register, and it is shifted to it's lsb.
view code
CRYPTOPP_NAKED word Add_64bit(word *C, const word *A, const word *B, unsigned int N) {

	AddPrologue

	// now: ebx = B, ecx = C, edx = A, esi = N
	AS2(	xor		 eax, eax)

	AS2(	cmp      esi, 01h)
	AS2(	pxor     mm2, mm2)		//mm2=0 (carry)

	AS1(	jbe       AddLabelcd)		//N<2
	AS2(	shl		 esi, 02h)		//esi = N*4 = num. of bytes

	AS1(AddLabel3c:)	
	AS2(	movq     mm0, MMWORD PTR [edx + eax])	// mm0 = *A

	AS2(	movq     mm1, MMWORD PTR [ebx + eax])	// mm1 = *B

	AS2(	add      eax, 08h)				//
	AS2(	cmp		 eax, esi)

	AS2(	movq     mm3, mm0)
	AS2(	movq     mm4, mm0)

	AS2(	paddq    mm0, mm1)				// mm0+=carry(mm2)
	AS2(	paddq	 mm0, mm2)				//mm0 = mm0+mm1

	AS2(	por      mm3, mm1)				//mm3 = *A | *B
	AS2(	pand     mm4, mm1)				//mm4 = *A & *B (carry* in msb)

	AS2(	movq	 MMWORD PTR [ecx +eax - 8], mm0)	// *C = *A + *B

	AS2(	pandn	 mm0, mm3)				//mm0=mm3&!mm0		(carry** in msb)
	AS2(	por		 mm0, mm4)

	AS2(	psrlq    mm0, 03fh)				//mm0= carry* or carry **
	AS2(	movq     mm2, mm0)

	AS1(	jl      AddLabel3c)
	AS1(AddLabelcd:)
	AS3(	pextrw   eax, mm2, 00h)

	AS1(	emms)
	
	AddEpilogue
}
hide code
back to top

64bit*2-Add

An attempt to add using 128bit XMM registers that are devided into two 64-bit integers. The carry from each of those 64bit integers is retrieved as in the addition above. Carry from the first element is added into the secound element. Another possibility is that of carry propogation. That is when the carry from the secound 64bit element addition is created because of the carry from the first 64bit element addition. This possibility of a carry is calculated seperately.
The first version created for 64bit*2 addition assumes all data to be 128bit-alligned, and does not return a value. Since this version did not create any speedup over previous ones - a more complex version that will handle unaligned data and return a value was not created.
view code
CRYPTOPP_NAKED word Add_64bitx2(word *C, const word *A, const word *B, unsigned int N) {

	_asm{
			

	push     esi
	push	 ebx
	mov      edx, DWORD PTR [esp+0Ch]	//edx=C

	mov      ecx, DWORD PTR [esp+10h]	//ecx=A
	mov      eax, DWORD PTR [esp+014h]	//eax=B

	mov      esi, DWORD PTR [esp+018h]	//esi=N
	xor		 ebx, ebx
	cmp      esi, 01h
	pxor     xmm2, xmm2		//xmm2=0 (carry from previous)

	movdqu xmm7, add_64x2_carry_mask
	jbe       AddLabelcd		//N<2
	shl		 esi, 02h		//esi = N*4 = num. of bytes
AddLabel3c:	
	movdqa   xmm0, XMMWORD PTR [ecx + ebx]	// mm0 = *A

	movdqa   xmm1, XMMWORD PTR [eax + ebx]	// mm1 = *B
	add      ebx, 10h				//

	cmp		 ebx, esi
	movdqa     xmm3, xmm0
	movdqa     xmm4, xmm0
	paddq    xmm0, xmm1				// mm0 = mm0+mm1
	paddq	 xmm0, xmm2				//mm0+=carry(mm2)

	por      xmm3, xmm1				//mm3 = *A | *B
	pand     xmm4, xmm1				//mm4 = *A & *B (carry* in msb)
	pxor xmm1,xmm1
	movdqa   xmm5, xmm0				//xmm5 is result

	movdqa   xmm2, xmm0		//
	pandn	 xmm0, xmm3				//mm0=mm3&!mm0		(carry** in msb)
	por		 xmm0, xmm4				//xmm0 = both carrys in both msbs

	movdqa xmm6, xmm0		
	pslldq xmm0, 1		//xmm0 - carry from first to second
	pand   xmm0,xmm7
    paddq    xmm5, xmm0
	movdqa   XMMWORD PTR [edx + ebx -10h], xmm5	// *C = *A +*B + carry

	pandn xmm5, xmm2	//xmm5 - carry to next from adding the carry in msb
	por xmm6, xmm5
	psrlq xmm6, 03fh
	punpckhdq xmm6, xmm1
	movdqa xmm2, xmm6
	jl      AddLabel3c
AddLabelcd:

	pextrw   eax, xmm2, 00h
	pop		 ebx
	pop      esi
	ret
		}
}
hide code
back to top

16bit*8-Add

16bit is the maximum size inside XMM register that a single assembler command can add while saturating the result. This allows a simple way to understand weather or not there was a carry - by comparing the result from the saturated addition with that of the non-saturated addition (if they're different then there was an overflow). When adding this type of elements- carry propogation becomes a real problem. Carry from the first elemnt added might change the result of any of the other elements added - this might take up to 7 more additions of previos carrys. However, carry propogation is quite rare. For that reason The normal process of the addition loop does not handle carry propogation. This case is tested inside the loop - and in case of carry propogation another part of the code is activated.
If the data has 64bit tail, or if the information is not 128bit alligned - it is handled by the code of 64bit addition, since this is the code that proved to be the best inside the program.
*based on a function created by Koby Gottlieb.
view code
CRYPTOPP_NAKED word Add_16bitx8(word *res, const word *a, const word *b, word n)
{		


	AddPrologue

	AS2(	xor eax,eax)

	AS2(	pxor xmm5,xmm5) // last carry
	// now: ebx = B, ecx = C, edx = A, esi = N eax=i(0)
	AS2(	shl		 esi, 02h)		//esi = N*4 = num. of bytes

	AS2(	mov 	edi, ebx)
	AS2(	or  	edi, ecx)

	AS2(	or  	edi, edx)
	AS2(	test	edi, 15)

	AS1(	jnz		Done_128_part)

	AS2(	mov 	ebp, esi)

	AS2(	and 	ebp, -10h)		//ebp is maximal so that ebp<=esi, ebp mod 16 =0
	AS1(	jz		Done_128_part)

	AS2(	add eax,0)    // clear carry
	AS2(	pxor xmm6,xmm6) // zero mask

	AS2(	pcmpeqd xmm4,xmm4) // -1 mask
	AS1(L:)
	AS2(	movdqa xmm0,[edx+eax]) // load first number

	AS2(	movdqa xmm1,[ebx+eax]) // load second number
	AS2(	movdqa xmm2,xmm0)      // save copy of first

	AS2(	paddw xmm0,xmm1)       // regular 16 bit
	AS2(	paddusw xmm2,xmm1)     // saturated add

	AS2(	pcmpeqw xmm2,xmm0)     // -1 if same result (no carry)
	AS2(	pxor    xmm2,xmm4)     // xor with -1 so -1 if carry

	AS2(	movdqa xmm3,xmm2)      // duplicate
	AS2(	pslldq xmm2, 2)        // put carry in the next word

	AS2(	psrldq xmm3,14)        // save carry for next iteration
	AS2(	paddw  xmm2,xmm5)      // add carry from last iteration

	AS2(	movdqa  xmm5,xmm3)       // save carry for next iteration
	AS2(	psubw xmm0,xmm2)       // add carry (sub -1)

	AS2(	movdqa  xmm7,xmm0)       // save copy of result
	AS2(	pcmpeqw xmm7,xmm6)     // compare if zero

	AS2(	pand xmm7,xmm2)        // result is zero and we added carry so it is carry propagate
	AS2(	pmovmskb edi,xmm7)     // check if xmm7 is all zero if not loop for carry

	AS2(	test edi,edi)
	AS1(	jnz Carry_check)       // carry propagate case

	AS2(	add eax,16)
	AS2(	cmp eax,ebp)

	AS2(	movdqa [ecx+eax-16],xmm0)   // save the result

	AS1(	jl L)
	AS1(	jmp Done_128_part)
	AS1(Carry_check:)

	AS2(	por    xmm3,xmm7)
	AS2(	pslldq xmm7,2)        // Propagate carry

	AS2(	psubw  xmm0,xmm7)     // add carry
	AS2(	movdqa  xmm2,xmm0)    
	AS2(	pcmpeqw xmm2,xmm6)

	AS2(	pand xmm7,xmm2)
	AS2(	pmovmskb edi,xmm7) // check if xmm7 is all zero if not loop for carry

	AS2(	test edi,edi)
	AS1(	jnz Carry_check) 
	AS2(	movdqa [ecx+eax],xmm0)

	AS2(	psrldq xmm3,14)        // save carry for next iteration
	AS2(	por xmm5, xmm3)        // or the propagate carry with the add carry only one can happened.

	AS2(	add eax,16)
	AS2(	cmp eax,ebp)

	AS1(	jl L)

	AS1(Done_128_part:)

	AS2(	cmp      esi, eax)

	AS2(	movdq2q  mm2, xmm5)
	AS2(	psrlq    mm2, 0fh)

	AS1(	jbe       AddLabelcd)		//N<2
	AS1(AddLabel3c:)	
	AS2(	movq     mm0, MMWORD PTR [edx + eax])	// mm0 = *A

	AS2(	movq     mm1, MMWORD PTR [ebx + eax])	// mm1 = *B

	AS2(	add      eax, 08h)				//
	AS2(	cmp		 eax, esi)

	AS2(	movq     mm3, mm0)
	AS2(	movq     mm4, mm0)

	AS2(	paddq    mm0, mm1)				// mm0+=carry(mm2)
	AS2(	paddq	 mm0, mm2)				//mm0 = mm0+mm1

	AS2(	por      mm3, mm1)				//mm3 = *A | *B
	AS2(	pand     mm4, mm1)				//mm4 = *A & *B (carry* in msb)

	AS2(	movq	 MMWORD PTR [ecx +eax - 8], mm0)	// *C = *A + *B

	AS2(	pandn	 mm0, mm3)				//mm0=mm3&!mm0		(carry** in msb)
	AS2(	por		 mm0, mm4)

	AS2(	psrlq    mm0, 03fh)				//mm0= carry* or carry **
	AS2(	movq     mm2, mm0)

	AS1(	jl      AddLabel3c)
	AS1(AddLabelcd:)
	AS3(	pextrw   eax, mm2, 00h)

	AS1(	emms)

	AddEpilogue
 }

hide code
back to top

Testing the functions.

To understand which of the functions is the fastest one, I created a new program that generates random numbers and adds them. To make sure that the order in which these functions are arranged is of no importance - generation of the random numbers is done before each function is run, followed by a stalling sequence. The first two rows of the following table show the results retrieved from Vtune when adding 12e8 numbers in each of the functions. However, these results do not nececcarily show the way the functions handle nubers generated inside the actuall program. The four best functions - Pentium-optimized, P4-optimized, 64bit and 16bit*8 were tested against one another inside the cryptopp testvector benchmark. This bechmark was selected because it has the least variability. Numbers shown for this benchmark are averages between 3 conclusive runs. The differance between those runs is around 100-200 clockticks (for the entire program), and almost allways smaller then the differences between runs using different functions.
All results are in clockticks samples (=millisecounds)

Addition tests results

P4-Optimized Pentium-Optimized 64bit 16bit*8 64bit*2
1088bit Additions (64bit Alligned) 11,673 9,716 9,808 10,305
1152bit Additions (128bit Alligned) 13,235 12,041 10,560 10,381 13,224
Function inside the program 10,663 10,053 10,076 12,236
Entire program 42,000 41,464 41,068 43,334

Conclusion

The 64bit addition function proved to be the most efficient function inside the program. It is noticeable that the function seems better then the pentium4 optimized only when the numbers are 128bit alligned. Inside the program the function might acquire another speedup simply for using 64bit variables (see
MOB Load Replays ) Addition mostly takes place inside multiplications - meaning it is called right after and right before other functions using MMX technology.
back to top

Subtraction

Subtraction is quite similar to addition. For that reason, three versions of subtraction, equivalent to the best three versions of addition were tested:
Pentium Optimized
Written in Assembler, using the intrinsic command "sbb", which subtracts carry from previous subtraction along with the two operands. view code
CRYPTOPP_NAKED word POpt_Subtract(word *C, const word *A, const word *B, unsigned int N)
{

	AddPrologue

	// now: ebx = B, ecx = C, edx = A, esi = N
	AS2(	sub ecx, edx)	// hold the distance between C & A so we can add this to A to get C

	AS2(	xor eax, eax)	// clear eax

	AS2(	sub eax, esi)	// eax is a negative index from end of B

	AS2(	lea ebx, [ebx+4*esi])	// ebx is end of B

	AS2(	sar eax, 1)		// unit of eax is now dwords; this also clears the carry flag

	AS1(	jz	loopendSub)		// if no dwords then nothing to do

	AS1(loopstartSub:)
	AS2(	mov    esi,[edx])			// load lower word of A

	AS2(	mov    ebp,[edx+4])			// load higher word of A

	AS2(	mov    edi,[ebx+8*eax])		// load lower word of B

	AS2(	lea    edx,[edx+8])			// advance A and C

	AS2(	sbb    esi,edi)				// subtract lower words

	AS2(	mov    edi,[ebx+8*eax+4])	// load higher word of B


	AS2(	sbb    ebp,edi)				// subtract higher words
	AS1(	inc    eax)					// advance B


	AS2(	mov    [edx+ecx-8],esi)		// store lower word result

	AS2(	mov    [edx+ecx-4],ebp)		// store higher word result


	AS1(	jnz    loopstartSub)			// loop until eax overflows and becomes zero

	AS1(loopendSub:)
	AS2(	adc eax, 0)		// store carry into eax (return result register)


	AddEpilogue

}
hide code
Pentium 4 Optimized
Written in assembler, using "cmovc" and "jc" functions. view code
CRYPTOPP_NAKED word P4Opt_Subtract(word *C, const word *A, const word *B, unsigned int N)
{

	AddPrologue

	// now: ebx = B, ecx = C, edx = A, esi = N
	AS2(	xor		eax, eax)

	AS1(	neg		esi)
	AS1(	jz		loopendSubP4)		// if no dwords then nothing to do

	AS2(	mov		edi, [edx])

	AS2(	mov		ebp, [ebx])
	AS1(	jmp		carry1SubP4)

	AS1(loopstartSubP4:)
	AS2(	mov		edi, [edx+8])

	AS2(	add		edx, 8)
	AS2(	add		ecx, 8)

	AS2(	mov		ebp, [ebx])
	AS2(	sub		edi, eax)

	AS1(	jc		carry1SubP4)
	AS2(	xor		eax, eax)

	AS1(carry1SubP4:)
	AS2(	sub		edi, ebp)

	AS2(	mov		ebp, 1)
	AS2(	mov		[ecx], edi)

	AS2(	mov		edi, [edx+4])
	AS2(	cmovc	eax, ebp)

	AS2(	mov		ebp, [ebx+4])
	AS2(	add		ebx, 8)

	AS2(	sub		edi, eax)
	AS1(	jc		carry2SubP4)

	AS2(	xor		eax, eax)

	AS1(carry2SubP4:)

	AS2(	sub		edi, ebp)
	AS2(	mov		ebp, 1)

	AS2(	cmovc	eax, ebp)
	AS2(	mov		[ecx+4], edi)

	AS2(	add		esi, 2)
	AS1(	jnz		loopstartSubP4)

	AS1(loopendSubP4:)

	AddEpilogue
}
hide code
64bit subtraction.
As with MMX addition - carry is determined from the msb's of the two operands and the result. Referring to the msb's of a,b,res when a-b-previous_carry=res, in subtraction carry = ((a & b & res) | (!a & (b | res))). view code
CRYPTOPP_NAKED word 64bit_Subtract(word *C, const word *A, const word *B, unsigned int N) {

	_asm{
			
	push     esi
	push	 ebx
	mov      edx, DWORD PTR [esp+0Ch]	//edx=C

	mov      ecx, DWORD PTR [esp+10h]	//ecx=A
	mov      eax, DWORD PTR [esp+014h]	//eax=B

	mov      esi, DWORD PTR [esp+018h]	//esi=N
	xor		 ebx, ebx
	cmp      esi, 01h
	pxor     mm2, mm2		//mm2=0 (carry)

	jbe       AddLabelcd		//N<2
	shl		 esi, 02h		//esi = N*4 = num. of bytes
AddLabel3c:	
	movq     mm0, MMWORD PTR [ecx + ebx]	// mm0 = *A

	movq     mm1, MMWORD PTR [eax + ebx]	// mm1 = *B
	add      ebx, 08h				//

	cmp		 ebx, esi
	movq     mm3, mm0
	movq     mm4, mm0
	psubq    mm0, mm1				// mm0-=carry(mm2)
	psubq	 mm0, mm2				//mm0 = mm0-mm1

	pand     mm4, mm1				//mm4 = *A & *B
	movq	 mm5, mm1				//mm5 = *B
	movq	 MMWORD PTR [edx +ebx - 8], mm0	// *C = *A + *B

	por      mm1, mm0				//mm1 = *B | *res
	pand	 mm4, mm0				//mm4 = A & B & res (carry* in msb)

	pandn    mm3, mm1				//mm3 = !A & (B | res) (carry** in msb)
	por		 mm3, mm4
	psrlq    mm3, 03fh				//mm3= carry* or carry **

	movq     mm2, mm3
	jl      AddLabel3c
AddLabelcd:	
	pextrw   eax, mm2, 00h
	emms
	pop		 ebx
	pop      esi
	ret
		}
}

hide code
The Addition function used in all cases is the 64bit Addition function.

Subtraction tests results

P4-Optimized Pentium-Optimized 64bit
1152bit*12e8 Subtractions (128bit Alligned) 12,999 12,660 13,558
Function inside the program 10,076 9,680 9,939
Entire program 41,068 40,664 40,797
back to top

Conclusion

In this case the Pentium-Optimized version shows the best result, both alone and inside the program. The difference between subtraction and addition might be caused by two factors. First, the boolean function calculating carry for subtraction requires 2 more instructions then the one used for addition (18 instructions inside the loop instead of 16). Moreover, subtraction might be used differently inside the program, in such way that handeling 64bit for itself creates less speedup for it.
Replacing both Addition and Subtraction functions created a speedup of 6% for the entire testvectors benchmark.
back to top

RSA

Basic RSA encryption has 2 public parametes: e-the public key and n-the modulus, and one private parameter: d- the private key. Encryption or signature verification is done Plaintexte modulus n = Cyphertext. Decryption or signing is done Cyphertextd modulus n = Plaintext.
However - Cryptopp implements RSA using more efficient algorythm. In this variant - the public key is chosen to be small and so public exponentiation does not take alot of time. Private exponentiation uses other parameters that can only be calculated by the holder of the private key: p and q which are two large primes such that p*q=n. Most of the work in the exponentiation is then done in two seperate exponentiations - one is modulus p and one is modulus q. p and q are about half the size of n (in bits) - and so both exponentiatian are done on base and exponents that are about half the size of the originals. Both exponentiations are completley independant, and the take together the big majority of time of the RSA activity.
I have created a version in which those two exponentiations take place on different threads.
view function unthreaded
Integer ModularRoot(const Integer &a, const Integer &dp, const Integer &dq,

					const Integer &p, const Integer &q, const Integer &u)
{

	Integer p2 = ModularExponentiation((a % p), dp, p); //this will be done in one thread

	Integer q2 = ModularExponentiation((a % q), dq, q); //this in another

	return CRT(p2, p, q2, q, u);
}

hide function
back to top

Implementing threading

Threading was implemented with regard to cryptopp being a library - with maximum adaptibity for multiple purposes. Because it seems out of this project's scope-the threading mechanism was not made entirely "complete", but it was based on assumptions that will allow a more complete implementation to require minimal changes of the current code.

portabe-threads

In order to allow simple implementation over a few compilers/environments/threading libraries etc, the program does not use any specific system but a simplified portable one, with only the most basic features that can be trivially implemented on any machine. An implementation for Windows API was created.
view declarations
typedef HANDLE PortThread;

typedef HANDLE PortCondVar;
typedef HANDLE PortSemaphore;
typedef HANDLE PortMutex;
typedef unsigned long (__stdcall* ThrFuncPointer)(void*);

PortThread portCreateThread(const ThrFuncPointer initFunc, void* param);
void portKillThread(PortThread thrd);

PortCondVar portCreateCondVar();
void portWaitOnCondVar(PortCondVar cnd);
void portReleaseOneOfCondVar(PortCondVar cnd);

void portKillCondVar(PortCondVar cnd);

PortSemaphore portCreateSemaphore(int initVal);
void portWaitOnSemaphore(PortSemaphore smph);

void portReleaseOneOfSemaphore(PortSemaphore smph);
void portClearSemaphore(PortSemaphore smph);
void portKillSemaphore(PortSemaphore smph);

PortMutex portCreateMutex();
void portWaitOnMutex(PortMutex mtx);
void portReleaseMutex(PortMutex mtx);

void portKillMutex(PortMutex mtx);
hide declarations

HelperThread

A HelperThread is an abstract class that represents a thread ment to process some sort of data. A specific helper-thread is created once (to avoid thread creation overhead) and immediately deactivates itself. Any time the thread is needed it's paramaters are changed while it is still asleep, and it is then signaled to start working. When it's work is done it releases a different signal and goes back to sleep - until it is needed again. view declaration
class HelperThread
{

 public:
	 HelperThread();
	 virtual ~HelperThread();
 protected:

	friend unsigned long __stdcall mulThreadMainThrFunc(void*);
	virtual void start();

	virtual void doYourStuff() = 0; //actuall work is here
	virtual void waitTillDone();

	PortCondVar startExec, endExec;
	PortThread theThread;
	static const ThrFuncPointer mainThrFunc;
};
hide declaration

An MRThread is a non-abstract child of a HelperThread that holds the specific arguments and action of the HelperThread that is used to calculate Modular Root. This class also holds a static pointer to a default MRThread (explained later).
view threaded modular root function
Integer ModularRootThreaded(const Integer &a, const Integer &dp, const Integer &dq,

					const Integer &p, const Integer &q, const Integer &u, MRThread* myMRThread)
{

	Integer q2;
	Integer p2;
	if (myMRThread) {
		myMRThread->a=&a;

		myMRThread->p2=&p2;
		myMRThread->p=&p;
		myMRThread->dp=&dp;

		myMRThread->start();
		q2 = ModularUnthreadedExponentiation((a % q), dq, q);

		myMRThread->waitTillDone();
	} else {
		q2 = ModularExponentiation((a % q), dq, q);

		p2 = ModularExponentiation((a % p), dp, p);
	}

	return CRT(p2, p, q2, q, u);
}
hide function

InvertibleRSAFunction - Thread Safety

The class InvertibleRSAFunction is the class implementing the math of private RSA operations. This class was added a member MRThrad pointer, that will be the thread used for calculating the modular root when it will be needed. Setting it to 0 means no threading will be used.
Any two RSA operations are allowed to use the same MRThread as long as they dont run parallely. So, if all private RSA operations are done within the same thread they can all use the default MRThread (using the empty-constructor of InvertibleRSAFunction, or manually setting the member MRThread to MRThread::getDefaultThread()) otherwise, the threads used by InvertibleRSAFunction must be set to other threads using either the constructor or the specific set function.
back to top

Results

The following are average results from six runs of each test on each version and computer. In brackets are the stdev values measured between those six runs. We can see we have around a 20% speedup on a dual-core processor, and a 40% speedup on a dual-processor machine.
HT machine DP machine
unthreads threaded speedup unthreads threaded speedup
RSA 1024 Decryption 3.7(0.19) 2.86(0.08) 22.65% 4.35(0.06) 2.53(0.07) 41.92%
RSA 2048 Decryption 20.73(0.08) 15.47(0.61) 25.38% 24.45(0.09) 13.6(0.24) 44.4%
RSA 1024 Signature 3.52(0.27) 2.9(0.13) 17.59% 4.19(0.07) 2.53(0.07) 39.75%
RSA 2048 Signature 19.77(0.08) 15.94(0.44) 19.38% 23.26(0.12) 13.78(0.34) 40.76%
back to top

Exponentiation

In today's world of cryptography, and especially in public key cryptography systems, exponantiation by large exponents is the main computational effort. Alot of academic research was done in order to decrease that effort-by reducing the number of group-multiplications needed to reach the result. Cryptopp uses an exponentiation system known as "Right-to-Left Sliding Window Exponentiation". In my project I have tried to use threads in an attempt to increase the performance of the system for HT or Multi-Processor computers.
back to top

Background: Right-to-Left Sliding Window Exponentiation

Let n be the size of the exponent, k be the size of the window, and B is a variable initialized to the base that we are exponentiating. There are also 2(k-1) buckets, each is for a different window.
A window is a a series of k bits from the exponent, that's lsb is always '1'. No two windows overlap, and the entire set of windows that fit an exponent cover all the non-'0' bit this exponent holds. These windows are determined scanning from right to left.
In general, we scan the exponent from right to left (lsb to msb). For any bit that we encounter that is the lsb of a window, we multiply the window found into the bucket that fits that window's contents. Then, for any bit we encounter we exponentiate B by 2 (that is multiply it by itself). The result can later be computed from the buckets using 2k -1 multiplications (0 if k=1). If the exponent is random, there should be a new window every k+1 bits (but never more then one every k bits). So, to calculate a random exponent we need in average: n+n/(k+1)+2k-1 multiplications (for k>=1, otherwise it's 1.5n).
See Appendix for eplanation on finding the optimal value for k.
A more detailed explenation about Right-to-Left Sliding Window Exponentiation is available
here
back to top

Cryptopp's implementation of the exponentiation

Cryptopp's implementation is able to exponentiate the same base by multiple exponents. This allows using the same values of B for all the exponents. All the other variables are saved seperately for any exponent.Almost all the function calls will be done with only one exponent, and in rare cases two.
The function's name is: AbstractGroup::SimultaniusMultiply
The template parameter is the type of the base and the results. AbstractGroup is, as it's name suggests, abstract. This class will later be instantiated and inherited by a child-class that will implement group operators (such as addition). The usuall use of it will be through a MultiplicativeGroup, and so it will indeed exponentiate - rather then multiply as suggested by it's name.
view code

//The names of the functions used might be confusing. The usuall use of it will be through a
// MultiplicativeGroup, so the Add or Accumulat functions usually multiply, and Double multiplies
// the base by itself.

template <class T>

void AbstractGroup<T>::SimultaneousMultiply(T *results, const T &base, const Integer *expBegin, unsigned int expCount) const

{
	std::vector<std::vector<Element> > buckets(expCount);

	std::vector<WindowSlider> exponents;
	exponents.reserve(expCount);

	unsigned int i;

	for (i=0; i<expCount; i++)
	{

		assert(expBegin->NotNegative());
		exponents.push_back(WindowSlider(*expBegin++, InversionIsFast(), 0));

		exponents[i].FindNextWindow();
		buckets[i].resize(1<<(exponents[i].windowSize-1), Identity());
	}

	unsigned int expBitPosition = 0;
	Element g = base;

	bool notDone = true;

	while (notDone)
	{

		notDone = false;
		for (i=0; i<expCount; i++)
		{

			if (!exponents[i].finished && expBitPosition == exponents[i].windowBegin)
			{

				Element &bucket = buckets[i][exponents[i].expWindow/2];

				if (exponents[i].negateNext)
					Accumulate(bucket, Inverse(g));

				else
					Accumulate(bucket, g);
				exponents[i].FindNextWindow();
			}

			notDone = notDone || !exponents[i].finished;
		}

		if (notDone)
		{

			g = Double(g);
			expBitPosition++;
		}
	}

	for (i=0; i<expCount; i++)
	{

		Element &r = *results++;
		r = buckets[i][buckets[i].size()-1];

		if (buckets[i].size() > 1)
		{
			for (int j = buckets[i].size()-2; j >= 1; j--)
			{

				Accumulate(buckets[i][j], buckets[i][j+1]);

				Accumulate(r, buckets[i][j]);
			}
			Accumulate(buckets[i][0], buckets[i][1]);

			r = Add(Double(r), buckets[i][0]);
		}
	}
}
hide code
back to top

Threading Intentions

Threading seems to have a bound potential in this case. Most of the work in the function, as seen above, is caused by multiplying B with itself n times. These multiplications cannot be neither threaded nor skipped - there is no way to calculate B4 that will be simpler then calculating B2 first. So threading only tries to speed-up the other n/(k+1)+2k-1 multiplications - of filling the buckets and retrieving the results out of them. Also, retrival of the result from the buckets is a serial process.
For that reason, beyond the attempts made to create a speedup for dual-core processors, some of the threading attempts were initially directed towards Multi-Processor environments, allowing extra work to take place while ballancing the threads in hope that the overall result would be faster. Trends in todays CPU development seems to be going in the direction of Dual-Processor systems.
back to top

Threading the function - version 1

The idea in this version was to separate the function into two different threads with very little dependency between them, saving work for the main thread but creating alot of extra work altogether. Each thread calculates the exponentiation of the base by a part of the exponent. When all the threads' work is done the main thread multiplies the results from both threads to get the final result. The exponents for each thread are chosen such that: exp1 + exp2 = exponent. So baseexp1*baseexp2 = baseexponent.
One thread will handle thebeginning of the exponent, the other will handle the end. The devision point is set as such that the main thread will have the least possible amount of work, but will still have more work then the helper one. This is to make sure that the main thread won't usually have to wait for the helper thread, thus keeping the main idea of minimum dependency between threads.
Say we devide the exponent, of size n, into two parts - the first holds the first j bits of the exponent - and the secound holds j zeros in it's lower-bits, and the following n-j bits extracted from the original exponent. Say that the first exponent is calculated using window length of q (optimal for j random bits), and the secound is calculated using window length of l (optimal for n-j bits).
The first exponent will calculate using: j + j/(q+1) + 2q -1 multiplications.
The secound exponent will calculate using: n + (n-j)/(l+1) + 2l -1 multiplications.

So we're looking for such j that will give us:
j + j/(q+1) + 2q -1 ~ n + (n-j)/(l+1) + 2l -1
with the right expression a little larger then the left one. This expression can be simplified to:
j/(q+1) + 2q = n-j + (n-j)/(l+1) + 2l
We may assume that 2l is relatively neglegable (since j will surely be far closer to 1.0 then to 0.5). To make calculation easier, and because we want the right expression to be a little larger rather then identical - we shall calculate j as such:
j/(q+1) = (n-j)*A - 2q
j = (n*A - 2q) / (A + 1/(k+1))
A is a constant, A~1 and it's value will determine the margin between the work done by the two threads. Calculating k may seem like a problem, because j is unknown and k is dependant on it. However - n > j > 0.5n. So if we mark k to be the optimal window size for calculating an exponent of length n (n is known, and therefore so is k), q can be only either k or k-1.
J is therefore calculated in two iterations - In the first k is used, then q is derived from the resulted j - and then j is calculated from the new value of q (the final windowsize used will be later calculated from the final value of j).

Summery:

The following are average workloads on threads in terms of group multiplication. To give an idea of the numbers in the summery - average values for a random exponent of size n = 1024bit are shown in brackets.
original work: n + n/(k + 1) + 2k - 1 {1233}
main thread: n + (n-j)/(l+1) + 2l -1 {1070}
secound thread:j + j/(q+1) + 2q -1 {1044}
back to top

Threading the function - version 2

The idea in this version was to devide the work into producer and consumer threads. No extra work is created except for unavoidable threading overhead. The produced/consumed values are the values of B mensioned earlier. One thread constantly multiplies B by itself and sends it's value to the other thread whenever a beginning of a window is encountered. The other thread multiplies those values into the appropriate buckets. When the work is done, the main thread calculates the result from the buckets' values.

Variant - version 2.1

As we've seen before - the more buckets that exponantiation uses - the less work it will have filling up the buckets, but the more it will have multiplying these buckets into a result. The longer the exponent is - the more buckets the program uses. However, when filling the buckets is done in another thread we can use smaller windows to cause the helper thread to work harder - and the main thread (which calculates the result) to work less. This can decrease overall time because the helper thread has far less work then the main one (it does one multiplication for any (k+1) the main thread does). However, this does create more overall-work so this variant is anticipated to work better for Multi-processor environment in which the threads don't battle for CPU time then on HT machine.
I've fixed the window-size on 3, that way the main thred still works about 4 times as much as the main thread - and post-computation only takes 7 multiplications. This value was also chosen via experimentation on different values.

Summery

The following are average workloads on threads in terms of group multiplication. To give an idea of the numbers in the summery - average values for a random exponent of size n = 1024bit are shown in brackets.
original work: n + n / (k + 1) + 2k - 1 {1233}
main thread - v2: n + 2k -1 {1087}
secound thread - v2:n / (k + 1) {146}
main thread - v2.1: n + 7 {1031}
secound thread - v2.1:n / 4 {256}
back to top

Implementing Threading

Threading was inserted through another layer of inheritation. The class ThreadedMulGroup inherits from AbstractGroup and has it's own implementation for the function SimultaniousExponentiate. Inheriting from ThreadedMulGroup rather then AbstractGroup makes the inherited group use the threaded version. The threads used for exponentiation are implemented inside the class MulThread (another child of HelperThread).
Note: An AbstractRing carrys out all it's exponensiations through it's member MultiplicativeGroup. MultiplicativeGroup is inherited from ThreadedeMulGroup, and therefore an AbstractRing uses threading when exponentiating even if it's not inherited from ThreadedMulGroup. An AbstructRing also contains functions and constructors handeling the HelperThread attached to it's member MultiplicativeGroup. This is also true for classes inheriting from AbstractRing.

Using Modular-Root Thread along with Multiplicative threads

Since I only have two CPUs, threading the multiplications done while calculating the two exponentiations will only result in unnecessary threading overhead. These multiplications are therefore done without threading. This was done by setting the MulThread of the ModularArithmetic objects created for these exponentiations to 0. Multiplications done outside the ModularRoot function will still be threaded.
back to top

Results

Four versions of the threading mechanisms were tested using a new benchmark made for the cause. The versions are: Non-threaded, Element-threaded (threading version 2), DP variant for element threading (threading version 2.1) and exponent-threaded (threading version 1). The benchmark creates modulus multiplications, with a fixed modulus but random bases and exponents. The modulus is the same size as the bases.
Each version was run four times. Each four conclusive tests run were of the four different versions, to prevent as much as possible other CPU interference from dammaging the results. Results shown here are the average millisecounds required per operation. In brackets are the stdev values between the four runs.
Following tests examined threading efficiency over Elliptic-Curve multiplications over GF(p). These multiplications use the same SimultaniousMultiply function. All These Test use the same ECP(155) field used in crypto++'s benchmark.

Integer Exponentiation results

Test HT results DP results
base
size
exponent
size
Non elements elements
(DP variant)
exponent Non elements elements
(DP variant)
exponent
512bit 256bit 0.69 (0.01) 0.78 (0) 0.81 (0.01) 1.14 (0) 0.85 (0) 0.91 (0) 0.93 (0.01) 1.16 (0.07)
512bit 512bit 1.33 (0.04) 1.46 (0) 1.56 (0.01) 1.97 (0.01) 1.59 (0) 1.7 (0.01) 1.76 (0.01) 1.92 (0.07)
1024bit 128bit 2.19 (0.04) 2.15 (0.05) 2.13 (0.01) 3.69 (0.02) 2.72 (0.01) 2.59 (0.19) 2.45 (0.03) 3.77 (0.19)
1024bit 256bit 4.3 (0) 4.14 (0.09) 4.16 (0.02) 6.61 (0.03) 5.4 (0.05) 4.85 (0.31) 4.65 (0.02) 6.53 (0.42)
1024bit 512bit 8.39 (0.01) 8.29 (0.25) 8.43 (0.02) 12.4 (0.01) 10.49 (0.06) 9.99 (0.83) 9.39 (0.05) 11.77 (0.68)
1024bit 1024bit 16.45 (0.07) 16.05 (0.4) 16.6 (0.04) 23.67 (0.04) 20.54 (0.06) 19.48 (1.53) 18.51 (0.05) 22.54 (1.59)
2048bit 2048bit 50.94 (0.18) 49.23 (0.34) 52.5 (1.71) 72.35 (1.85) 63.96 (0.17) 56.65 (0.38) 56.65 (0.62) 65.65 (3.24)
3072bit 3072bit 229.87 (0.38) 222.05 (0.37) 234.09 (1.73) 322.74 (4.9) 289.35 (0.99) 263.95 (1.45) 264.06 (5.15) 299.26 (4.67)

Elliptic curve multiplication by constant results

HT results DP results
constant
size
Non elements elements
(DP variant)
exponent Non elements elements
(DP variant)
exponent
8bit 0.1 (0.02) 0.12 (0.02) 0.11 (0.03) 0.09 (0.05) 0.13 (0.03) 0.13 (0.04) 0.12 (0.03) 0.14 (0.02)
16bit 0.47 (0) 0.47 (0.01) 0.48 (0.01) 0.4 (0.19) 0.59 (0.01) 0.6 (0.01) 0.59 (0) 0.6 (0.01)
62bit 1.79 (0.02) 1.86 (0.02) 1.84 (0) 2.77 (1.36) 2.27 (0) 2.32 (0.02) 2.31 (0.01) 3.71 (0.13)
124bit 3.53 (0.03) 3.65 (0.04) 3.58 (0.03) 4.87 (2.4) 4.47 (0.03) 4.53 (0) 4.51 (0.04) 6.25 (0.31)
248bit 6.94 (0.03) 7.07 (0.04) 7.08 (0.04) 8.75 (4.31) 8.72 (0.06) 8.8 (0.04) 8.82 (0.03) 11.06 (0.86)

Conclusions

Exponent-threading creates too much overhead work, and produces no speedup over the non-threaded version. Element-threading does produce a measurable speedup over integer modulus exponentiation, and the variant meant for a Multiprocessor system does work better for a DP machine then the original version.
Due to the results, I have cancelled Elliptic Curve Engine's use of the threaded multiplication.
back to top

final program

Threading profiles


These results were retrieved from the Intel Thread Profiler (using Windows API), on a Hyper-Threaded pentium4 machine. These are the profiles of the critical path of the program when calculating an RSA 2048 decryption (on the right) and a 1024bit base & exponent exponentiation (on the left). It is clear that RSA calculus has much more ballanced threads, and indeed we saw it creates a larger speedup.
back to top

Threading checks


These results were retrieved from the Intel Thread Checker. This tool is used to check, amongst others, that different threads will not access the same data without blocking each other. Using the HelperThread mechanism helps determining this from a software point of view. All the acess to "shared" data is done through this class, that when used properly provides the necessary thread safety. The check was done while executing a benchmark of the final program. Helper Threads are sometimes not needed for a long period of time, which creates the wornings of long wait times of threads. Since the main thread is the only thread in charge of pther Helper Threads, it dies holding related sync objects. This is also not a prblem.
back to top

Speedup results

These results were retrieved when running the original against the final version, on a P4 and a DP computers respectively. Each test was run for five seconds, each benchmark was run 10 times. The results shown and speedup calculated is the average time (in milliseconds) per action. In braces is the stdev calculated between the 10 retrieved results. Only tests that use the SimultaniousMultiply function are shown.

HT machine DP machine
Test Original final speedup Original final
(DP variant)
speedup
RSA 1024 Encryption 0.12(0) 0.12(0) 0% 0.15(0) 0.15(0) 0%
RSA 1024 Decryption 3.49(0.03) 2.79(0.01) 20% 5.68(1.16) 3.45(0.05) 39%
Rabin 1024 Decryption 4.16(0.19) 4.1(0.22) 1% 5.17(0.27) 5.14(0.3) 1%
DLIES 1024 Encryption 2.82(0.12) 2.88(0.1) -2% 3.55(0.22) 3.26(0.12) 8%
DLIES 1024 Encryption with precomputation 3.01(0.14) 2.99(0.07) 0% 3.78(0.17) 3.74(0.07) 1%
DLIES 1024 Decryption 9.1(0.13) 9.1(0.13) 0% 11.42(0.21) 10.66(0.18) 7%
RSA 2048 Encryption 0.29(0.01) 0.3(0.01) -1% 0.37(0.02) 0.37(0) 0%
RSA 2048 Decryption 18.28(0.12) 14.84(0.09) 19% 22.84(0.14) 17.53(1.88) 23%
Rabin 2048 Decryption 20.6(1.25) 20.34(1.24) 1% 25.83(1.4) 24.09(2.87) 7%
DLIES 2048 Encryption 12.45(0.21) 11.54(0.27) 7% 15.62(0.44) 13.03(0.17) 17%
DLIES 2048 Encryption with precomputation 11.79(0.27) 11.69(0.36) 1% 14.75(0.44) 14.57(0.17) 1%
DLIES 2048 Decryption 54.83(6.16) 51.22(0.53) 7% 71.57(14.11) 61.66(10.23) 14%
RSA 1024 Signature 3.61(0.06) 2.79(0.03) 23% 5.53(1.49) 3.44(0.08) 38%
RSA 1024 Verification 0.12(0) 0.12(0) 0% 0.15(0) 0.15(0) 0%
Rabin 1024 Signature 4.1(0.06) 4.05(0.06) 1% 5.09(0.05) 5.07(0.06) 0%
Rabin 1024 Verification 1.11(0.14) 1.11(0.14) 0% 1.42(0.17) 1.34(0.11) 6%
RW 1024 Signature 4.63(0.13) 3.43(0.05) 26% 4.32(0.08) 4.32(0.03) 0%
RW 1024 Verification 0.06(0) 0.06(0) 0% 0.08(0.01) 0.07(0.01) 4%
NR 1024 Signature 1.44(0.01) 1.46(0.03) -2% 1.81(0.09) 1.65(0.03) 9%
NR 1024 Signature with precomputation 0.73(0.03) 0.71(0.03) 2% 0.93(0.06) 0.89(0.03) 4%
NR 1024 Verification 1.64(0.07) 1.68(0.16) -2% 2.06(0.08) 2.08(0.08) -1%
NR 1024 Verification with precomputation 1.17(0.08) 1.13(0.12) 4% 1.47(0.17) 1.41(0.17) 4%
DSA 1024 Signature 1.41(0.04) 1.44(0.01) -2% 1.76(0.04) 1.65(0.02) 6%
DSA 1024 Signature with precomputation 0.7(0.02) 0.71(0.02) 0% 0.88(0.01) 0.88(0) 0%
DSA 1024 Verification 1.63(0.11) 1.64(0.12) -1% 2.03(0.07) 1.97(0.09) 3%
DSA 1024 Verification with precomputation 1.17(0.12) 1.12(0.13) 4% 1.47(0.19) 1.37(0.14) 7%
ESIGN 1023 Signature 0.33(0.01) 0.33(0) 0% 0.41(0.01) 0.41(0.01) 0%
ESIGN 1023 Verification 0.11(0) 0.11(0.01) -1% 0.14(0) 0.14(0.01) -1%
ESIGN 1536 Signature 0.65(0.02) 0.65(0.02) 0% 0.82(0.04) 0.81(0.02) 0%
ESIGN 1536 Verification 0.27(0.01) 0.26(0) 2% 0.33(0) 0.33(0.01) 0%
RSA 2048 Signature 18.21(0.33) 14.79(0.09) 19% 22.73(0.21) 17.25(1.14) 24%
RSA 2048 Verification 0.3(0.03) 0.29(0) 2% 0.36(0.02) 0.36(0.01) 0%
Rabin 2048 Signature 20.74(0.6) 20.47(0.17) 1% 25.86(0.32) 24.33(2.42) 6%
Rabin 2048 Verification 2.89(0.16) 2.83(0.15) 2% 3.59(0.33) 3.58(0.29) 0%
RW 2048 Signature 18.43(0.21) 18.21(0.12) 1% 23(0.21) 21.34(1.52) 7%
RW 2048 Verification 0.13(0.01) 0.13(0.01) 1% 0.17(0.01) 0.17(0.02) 1%
NR 2048 Signature 6.3(0.16) 5.85(0.1) 7% 7.86(0.09) 6.63(0.08) 16%
NR 2048 Signature with precomputation 2.1(0.1) 2.06(0.03) 2% 2.64(0.1) 2.61(0.08) 1%
NR 2048 Verification 7.18(0.56) 7.14(0.19) 1% 8.94(0.29) 8.98(0.33) 0%
NR 2048 Verification with precomputation 3.4(0.22) 3.27(0.23) 4% 4.27(0.28) 4.19(0.32) 2%
ESIGN 2046 Signature 0.77(0.02) 0.77(0.02) 0% 0.96(0.02) 0.97(0.02) 0%
ESIGN 2046 Verification 0.28(0) 0.28(0.01) 0% 0.35(0.02) 0.35(0) 1%
DH 1024 Key-Pair Generation 1.42(0) 1.44(0.01) -1% 1.78(0.02) 1.65(0.18) 7%
DH 1024 Key-Pair Generation with precomputation 1.45(0.01) 1.51(0.01) -4% 1.83(0.02) 1.91(0.12) -4%
DH 1024 Key Agreement 2.53(0.11) 2.53(0.09) 0% 3.19(0.1) 3.01(0.04) 6%
DH 2048 Key-Pair Generation 6.28(0.21) 5.89(0.67) 6% 7.8(0.06) 6.49(0.09) 17%
DH 2048 Key-Pair Generation with precomputation 5.96(0.23) 5.95(0.67) 0% 7.34(0.06) 7.32(0.07) 0%
DH 2048 Key Agreement 9.06(0.3) 8.54(0.12) 6% 11.3(0.22) 10.09(0.89) 11%
MQV 1024 Key-Pair Generation 1.4(0.02) 1.41(0.01) -1% 1.75(0.03) 1.6(0.14) 8%
MQV 1024 Key-Pair Generation with precomputation 0.7(0.03) 0.69(0.02) 1% 0.86(0.06) 0.87(0.02) -1%
MQV 1024 Key Agreement 2.63(0.04) 2.62(0.09) 0% 3.31(0.04) 2.91(0.05) 12%
MQV 2048 Key-Pair Generation 6.21(0.22) 5.75(0.13) 7% 7.75(0.21) 6.53(0.22) 16%
MQV 2048 Key-Pair Generation with precomputation 2.04(0.12) 2.02(0.07) 1% 2.54(0.1) 2.52(0.04) 1%
MQV 2048 Key Agreement 11.3(0.12) 10.46(2.18) 7% 14.23(0.41) 10.82(0.11) 24%
back to top

Source and executables

final code
executable for an HT machine
executable for a DP machine
original executable
dat files
test vectors
Thread activation/parametrization is done in the file helperthread.h. To create the DP variant of element threading (version 2.1) set threadWindowSize() to return 3 (or try a diffrent fixed windowsize). To thread without creating extra work (version 2) make this function return 0.
The .dat files need to be in the same folder as the executable to run the benchmark (use " b"). To run a test vector use " tv ". the all.txt testvector runs all the tests.
back to top

Appendix - optimal window size

As seen before, calculating an exponentiation where the exponent is of size n using window size of k takes: n + n / (k + 1) + 2k - 1 multiplications average (1.5n if k=1), or n + n / k + 2k - 1 in the worst case (2n if k=1).
We can clearly see that if for a certain n, k is optimal then for anything less then n, only k or less can be optimal, and k or more for anything over k. So, to find out which k is optimal for any n - I searched for the transition points - meaning I searched for a value of n for which using windowsize k will require the same amout of multiplications as using windowsize k+1. I based my calculations on the worst-case scenario. This means that n can be found from this equation:
n + n / k + 2k - 1 = n + n / (k+1) + 2k+1 - 1
n / k = n / (k+1) + 2k
n(k+1) = nk + k(k+1)2k
n = k(k+1)2k
Transision between k=1 and k=2 is calculated seperately:
n + n/2 +22 - 1 = 2n
n=6
And so the function to retrieve the optimal windowSize for a given exponent length is:
//current program - using worst-case scenario:
return (expLen <= 6 ? 1 : (expLen <= 24 ? 2 : (expLen <= 96 ? 3 : (expLen <= 320 ? 4 : (expLen <= 960 ? 5 : (expLen <= 2688 ? 6 : 7))))));

//another possibility could be using average-case scenario:
return (expLen <= 17 ? 1 : (expLen <= 47 ? 2 : (expLen <= 159 ? 3 : (expLen <= 479 ? 4 : (expLen <= 1343 ? 5 : (expLen <= 3583 ? 6 : 7))))));

back to top