Motivation Maximal information with given bit numbers. Arithmetic with proper precision. Fairness of rounding. Features at the expenses of the complexity of the operations.
Standard Operations Exceptional Situations Rounding Modes Numerical Computing with IEEE Floating Point Arithmetic, Michael L. Overton, SIAM Standard 232 Typically Goal: Dynamic Range: largest #/ smallest # If too large, holes between #s
If too large, holes between #s CSE 246 4 Standard ulp (unit in the last place) Difference between two consecutive values of the significand. 3 Parts x = ~s be:sign, significand, exponent Sign Bit
23-bit Significand 8-bit exponent CSE 246 5 Standard ~e1e2e3e4e5e6e7e8s1s2s3s22s23 1.s1s2s3s22s23 normalized number 0.s1s2s3s22s23 denormalized number e1e2e3e4e5e6e7e8
00000000 00000001 00000010 0 x=0.s1s2s3s22s23 2-126 1 x=1.s1s2s3s22s23 2-126 2 x=1.s1s2s3s22s23 2-125 . 127 0 1 1 1 1 1 1 1 x=1.s1s2s3s22s23 20 . 253 1 1 1 1 1 1 0 1 x=1.s1s2s3s22s23 2126 254 11111110 x=1.s1s2s3s22s23 2127 255
11111111 x= Inf if (s1 s23)= 0, NaN otherwise. NaN Not a Number CSE 246 6 Standard 0.01x2-3 = 0.001x2-2 Same number, so normalize to remove redundancy Use a default 1 in front for one more bit precision. Smallest Number 0.0001x2-126 = 1.0x2-23x2-126 = 1x2-149 CSE 246 7
Standard - Example ~ eeeeeeee sssss sssss sssss sssss sss 0 00000000 00000000000000000000000 = 0.0000x2-126 1 00000000 00000000000000000000000 =-0.0000x2-126 0 00000000 00000000000000000000001 = 0.0001x2-149 0 00000001 00000000000000000000000 = 1.0000x2-126 normalized minimum 0 00000001 00000000000000000000001 = 1.0001x2-126 . . 0 01111111 00000000000000000000000 = 1.0000x20 0 01111111 00000000000000000000001 = 1.0001x20 0 10000000 00000000000000000000001 = 1.0001x21 CSE 246 8 Standard Example Cont.
0 11111110 00000000000000000000000 = 1.0000x2127 0 11111110 00000000000000000000001 = 1.0001x2127 0 11111110 11111111111111111111111 = 1.1111x2127 - Normalized Maximum 0 11111111 00000000000000000000000 = Inf Nmin = 1.0 x 2-126 Nmax = (2 2-23)2127 CSE 246 9 Double Floating Point ~ e1e2e11 s1s2s52 0 00000 s1s2s52 x=0.s1s2s52 2-1022 0 00001 s1s2s52 x=1.s1s2s52 2-1022 . .
0 01111 s1s2s52 x=1.s1s2s52 20 0 10000 s1s2s52 x=1.s1s2s52 21 . . 0 11110 s1s2s52 x=1.s1s2s52 21023 0 11111 s1s2s52 x=Inf if (s1s52)=0 CSE 246 10 Overflow/Underflow Underflow Denser Sparser Overflow
Nmin CSE 246 11 Nmax Addition/Multiplication ~s1xbe1 + (~s2xbe2) = ~sxbe = ~s1xbe1 + ~s2/be1-e2 x be1 = (~s1 + ~s2/be1-e2) x be1 (~s1xbe1) x (~s2xbe2) = ~(s1xs2)be1+e2 CSE 246
12 Exceptions a/0 = Inf if a > 0 a/Inf = 0 if a != 0 a0 = 0 aInf = Inf if a > 0 a + Inf = Inf 0Inf = invalid operation (NaN) 0/0 = invalid operation (NaN) Inf - Inf = NaN NaP op a = NaN CSE 246 13 Rounding Mode
Adder Output = Cout z1z0.z-1z-2z-l GRS Guard Bit Round Bit Sticky Bit, OR of all bits below bit R 1.101 x 23 +1.110 x 23 11.011 x 23 1.1011x24 CSE 246 Normalize need to round or 14 Rouding 1.110 23 - 1.101 23
0.001 23 1.000 20 1.101 23 - 1.111 22 1.101 23 - 0.1101 23 0.1101 23 1.101 22 CSE 246 normalize Guard bit 15 Rounding Round to the nearest even
CSE 246 1.10111 toward 0 1.1011 Toward +Inf 1.1100 Toward -Inf 1.1011 16 Conventional Rounding Error Rounding 1.10100 1.10101 1.10110
1.10111 1.101 1.101 1.110 1.110 Error = = = = 0 -0.25
+0.5 +0.25 Average Error = 0.5/4 = 0.125 CSE 246 17