Faster Modular Exponentiation Using Double Precision Floating Point Arithmetic on the GPU Niall Emmart, Fangyu Zheng, Charles Weems Talk Overview Mod Exp is an important cryptographic primitive: RSA, digital signatures, prime generation, content centric networks, etc Look at a simpler problem: big number multiplication

Computations can be done in the integer domain or floating point domain Narrow Samples vs Wide Samples Optimizations Results Question / Conjecture Big Num Multiplication Narrow Samples Given MP values A, and B, use floating point arithmetic for samples and column sums:

Ensure the column sums do not overflow the exact integer range. Implies each samples uses less than half the floating point precision Big Num Multiplication Wide Samples Computing a Dot-Product Where a0 = 250, b0 = 250 and a1 = 1, b1 = 1 Desired result: H(sum) = 2100, L(sum) = 1

Wide Samples Use the full precision of the floating point value. Well known fused-multiply-add technique can be used to compute high and low products of a sample as follows: full_product(double a_sample, double b_sample) { double hi, lo; hi = __fma_rz(a_sample, b_sample, 0.0); lo = __fma_rz(a_sample, b_sample, -hi); return (hi, lo);

} Alignment Problem Does not work well for computing dot products (or column sums): Consider: a0 = 250, b0 = 250 and a1 = 1, b1 = 1 We have p0 = (2100, 0) and p1 = (1, 0) Component-wise addition of p0 + p1 gives (2100, 0) not the desired (2100, 1) Alignment Problem -- Solution Solution -- force the alignment of the high and low products:

Use 52 bit samples instead of 53 Add 2104 to high products and 252 to low products full_product(double a_sample, double b_sample) { double c1=2104, c2=2104+252, add; hi = __fma_rz(a_sample, b_sample, c1); // high(a_samp*b_samp) + add = c2 hi; lo = __fma_rz(a_sample, b_sample, add);

// low(a_samp*b_samp) + 2104 252 return (hi, lo); } Overflow Problem Returning to our example: a0 = 250, b0 = 250 and a1 = 1, b1 = 1

We have p0 = (2104+2100, 252+0) and p1 = (2104+0, 252+1) These are correctly aligned. But we cant use double precision for the sums it will immediately overflow. A Slow Approach full_product(double a_sample, double b_sample) { double c1=2104, c2=2104+252, add; hi = __fma_rz(a_sample, b_sample, c1);

add = c2 hi; lo = __fma_rz(a_sample, b_sample, add); lo = lo - 252; hi = (hi - 2104) * 2-52; return ((uint64_t)hi, (uint64_t)lo); } IEEE 754 Double Precision Representation

Sign bit 11 bit Exponent with bias of 1023 52 bit mantissa, with implicit high bit (i.e., not stored in the bit representation). If we store 2104+x, in a double (with rz rounding), where 0 x < 2104, then 2104 becomes the

implicit bit and the 52-bit mantissa is exactly filled with . Likewise, if we store 252 + y in a double, where 0 y < 252 then 252 becomes the implicit bit and the 52-bit mantissa is exactly filled with y. Overflow Problem -- Solution full_product(double a_sample, double b_sample) { double c1 = 2104, c2 = 2104+252, sub; uint64_t uhi, ulo, mask = 0xFFFFFFFFFFFFFull; hi = __fma_rz(a_sample, b_sample, c1);

sub = c2 hi; lo = __fma_rz(a_sample, b_sample, sub); uhi = to_u64(hi) & mask; ulo = to_u64(lo) & mask; return (uhi, ulo); } Returning to our example Returning to our example: a0 = 250, b0 = 250 and a1 = 1, b1 = 1. p0 = full_product(a0, b0)

p1 = full_product(a1, b1) We have p0 = (248, 0) and p1 = (0, 1). These can be added using component-wise addition, using unsigned 64-bit integers. Big Num Multiplication Wide Samples MP Product Using Wide Samples mp_product(double a_samples[], double b_samples[])

{ double c1 = 2104, c2 = 2104+252, sub; uint64_t cols[2N] = {0}, mask = 0xFFFFFFFFFFFFFull; for(i=0;i

lo = __fma_rz(a_sample[i], b_sample[j], sub); cols[i+j] += to_u64(lo) & mask; cols[i+j+1] += to_u64(hi) & mask; } } Optimization Top 12 Bits

Top 12 bits are always constant. For high products, its 0x467, which is 104 + 1023 (the bias). For low products it is always 0x433, which is 52 + 1023. Since they are constant, we can add them to the column sums and by tracking the number of high and low products, we can cancel them all out with a single subtractions. The subtraction can even by avoided by initializing the column sum with the right magic value. MP Product Optimized Version mp_product(double a_samples[], double b_samples[]) {

double c1 = 2104, c2 = 2104+252, sub; uint64_t cols[2*N]; // magic values to cancel high/low constants for(i=0;i

hi = __fma_rz(a_sample[i], b_sample[j], c1); sub = c2 hi; lo = __fma_rz(a_sample[i], b_sample[j], sub); cols[i+j] += to_u64(lo); cols[i+j+1] += to_u64(hi); } Initial Value Routine

uint64_t make_initial(int high_count, int low_count) { uint64_t value = 0x467*high_count + 0x433*low_count; return ((value & 0xFFF)<<52); } Mod Exp Implementation (lots of hand waving) Parallel implementation with 4-8 GPU threads assigned to each Mod Exp instance Uses a CIOS word-by-word Montgomery reduction

Take advantage of free bits to avoid Montgomery correction steps (See Orup [15] or Walter [16]) Modular exponentiation implemented with fixed window exponentiation Results Performance of Modular Exponentiation on a GTX Titan Black GPU: Comparison to Prior Work Question / Conjecture / Future Work

Consider the case of A * B, where A and B are chains of doubles: A = D0 -> D1 -> D2 -> -> Dn-1 B = D0 -> D1 -> D2 -> -> Dn-1 Can we beat the performance of the standard (53 bit) algorithm with normalization by using 52-bits samples, forcing Di -> Di+1 to be 52 bits apart and use the approach described in this paper? Conjecture: Yes! We can do this in O(n2) steps Thank you! Questions?