# CSE 245 Lecture Notes CSE 246: Computer Arithmetic Algorithms and Hardware Design Fall 2006 Lecture 9: Floating Point Numbers Instructor: Prof. Chung-Kuan Cheng Motivation Maximal information with given bit numbers. Arithmetic with proper precision. Fairness of rounding. Features at the expenses of the

complexity of the operations. CSE 246 2 Topics: Floating Point Numbers (IEEE P754) CSE 246

Standard Operations Exceptional Situations Rounding Modes Numerical Computing with IEEE Floating Point Arithmetic, Michael L. Overton, SIAM 3 Standard 232 Typically Goal: Dynamic Range: largest #/ smallest #

If too large, holes between #s CSE 246 4 Standard ulp (unit in the last place) Difference between two consecutive values of the significand. 3 Parts x = ~s be:sign, significand, exponent Sign Bit

23-bit Significand 8-bit exponent CSE 246 5 Standard ~e1e2e3e4e5e6e7e8s1s2s3s22s23 1.s1s2s3s22s23 normalized number 0.s1s2s3s22s23 denormalized number e1e2e3e4e5e6e7e8

00000000 00000001 00000010 0 x=0.s1s2s3s22s23 2-126 1 x=1.s1s2s3s22s23 2-126 2 x=1.s1s2s3s22s23 2-125 . 127 0 1 1 1 1 1 1 1 x=1.s1s2s3s22s23 20 . 253 1 1 1 1 1 1 0 1 x=1.s1s2s3s22s23 2126 254 11111110 x=1.s1s2s3s22s23 2127 255

11111111 x= Inf if (s1 s23)= 0, NaN otherwise. NaN Not a Number CSE 246 6 Standard 0.01x2-3 = 0.001x2-2 Same number, so normalize to remove redundancy Use a default 1 in front for one more bit precision. Smallest Number 0.0001x2-126 = 1.0x2-23x2-126 = 1x2-149 CSE 246 7

Standard - Example ~ eeeeeeee sssss sssss sssss sssss sss 0 00000000 00000000000000000000000 = 0.0000x2-126 1 00000000 00000000000000000000000 =-0.0000x2-126 0 00000000 00000000000000000000001 = 0.0001x2-149 0 00000001 00000000000000000000000 = 1.0000x2-126 normalized minimum 0 00000001 00000000000000000000001 = 1.0001x2-126 . . 0 01111111 00000000000000000000000 = 1.0000x20 0 01111111 00000000000000000000001 = 1.0001x20 0 10000000 00000000000000000000001 = 1.0001x21 CSE 246 8 Standard Example Cont.

0 11111110 00000000000000000000000 = 1.0000x2127 0 11111110 00000000000000000000001 = 1.0001x2127 0 11111110 11111111111111111111111 = 1.1111x2127 - Normalized Maximum 0 11111111 00000000000000000000000 = Inf Nmin = 1.0 x 2-126 Nmax = (2 2-23)2127 CSE 246 9 Double Floating Point ~ e1e2e11 s1s2s52 0 00000 s1s2s52 x=0.s1s2s52 2-1022 0 00001 s1s2s52 x=1.s1s2s52 2-1022 . .

0 01111 s1s2s52 x=1.s1s2s52 20 0 10000 s1s2s52 x=1.s1s2s52 21 . . 0 11110 s1s2s52 x=1.s1s2s52 21023 0 11111 s1s2s52 x=Inf if (s1s52)=0 CSE 246 10 Overflow/Underflow Underflow Denser Sparser Overflow

Nmin CSE 246 11 Nmax Addition/Multiplication ~s1xbe1 + (~s2xbe2) = ~sxbe = ~s1xbe1 + ~s2/be1-e2 x be1 = (~s1 + ~s2/be1-e2) x be1 (~s1xbe1) x (~s2xbe2) = ~(s1xs2)be1+e2 CSE 246

12 Exceptions a/0 = Inf if a > 0 a/Inf = 0 if a != 0 a0 = 0 aInf = Inf if a > 0 a + Inf = Inf 0Inf = invalid operation (NaN) 0/0 = invalid operation (NaN) Inf - Inf = NaN NaP op a = NaN CSE 246 13 Rounding Mode

Adder Output = Cout z1z0.z-1z-2z-l GRS Guard Bit Round Bit Sticky Bit, OR of all bits below bit R 1.101 x 23 +1.110 x 23 11.011 x 23 1.1011x24 CSE 246 Normalize need to round or 14 Rouding 1.110 23 - 1.101 23

0.001 23 1.000 20 1.101 23 - 1.111 22 1.101 23 - 0.1101 23 0.1101 23 1.101 22 CSE 246 normalize Guard bit 15 Rounding Round to the nearest even

CSE 246 1.10111 toward 0 1.1011 Toward +Inf 1.1100 Toward -Inf 1.1011 16 Conventional Rounding Error Rounding 1.10100 1.10101 1.10110

1.10111 1.101 1.101 1.110 1.110 Error = = = = 0 -0.25

+0.5 +0.25 Average Error = 0.5/4 = 0.125 CSE 246 17