Representing float number in computer

Some stupid results always occurred in calculating the value of two float numbers. For example, in JavaScript, 0.1+0.2=0.30000000000000004. This wild result is the feature of the float number while representing in computer, which is not a bug. Storing float number in computer has to encode and decode, the algorithm is not the same as integer. IEEE754 is wildly used in computer, which use a formular to encode and decode float number.

Sign * Exponent * Fraction 

We can apply this formular to convert a decimal number 3.14 to:
(-1) * 10-2 * 314

Now, we can fill these three parts into a 32bits binary container(the number 10 can be ignored with convention).

[1bit][8bits][23bits]
[-1]  [2]    [314]

The same idea applies in IEEE754, but it has a little bit difference. IEEE754 based on power 2 and the fraction part must convert to binary. The fraction number always started with 1.xxxx, and it multiply the Exponent to get the decimal. The formular can be applied to:
Sign * 2n * (1+Fraction)

First of all, we should divide 3.14 by 2, to get the Exponent.
3.14 = 3.14 / 2 = 1.57 * 21

Next, we can apply the formular to:
(-1) * 21 * (1+0.57)
For convention, we don't need to store the base number 2 and the integer 1 in the fraction. Unfortunately, 0.57 is not a binary number. The next step, we have to convert 0.57 to binary. The calculating is simple, multiply it with 2, if the result large than 1 and set 1 to binary bit, if not, set 0. We can convert 0.57 to:

0.57 * 2 = 1.14  | 1
0.14 * 2 = 0.28  | 1
0.28 * 2 = 0.56  | 0
0.56 * 2 = 1.12  | 1
0.12 * 2 = 0.24  | 0
0.24 * 2 = 0.48  | 0
0.48 * 2 = 0.96  | 0
0.96 * 2 = 1.92  | 1
.....

This processing is infinity and finally get a approximate value. We will get the result 10010001111010111000011 to fill into the fraction part.
The next step is to fill the Exponent part. IEEE754 use a bias number(127) to represent -126 ~ 127 range in a 8bits binary number. If we want to store 21, we must add bias number 127 and then convert 128 to binary 1000000.
The third part is Sign, is the same as integer, 1 means negative and 0 means positive. Finally, we can store -3.14 in a float binary as three parts as:

1     10000000   10010001111010111000011
Sign  Exponent   Fraction

Converting IEEE754 to decimal

We can apply the same formular Sign * Exponent * (1+Fraction) to convert the above binary result to decimal. The first step is to divide it to three parts.

1     10000000   10010001111010111000011
Sign  Exponent   Fraction

We can map the Sign and Exponent in a simple way, but the fraction parts that we should convert each bit to multiply 2-n. So, we can convert the fraction part to:
10010001111010111000011 = 1x2-1+0x2-2+0x2-3+1x2-4 +....+ 1x2-23 = 0.57

Finally, we can apply the formular to calculate the decimal:
(-1) * 2128-127 * (1+0.57) = 3.14

The puzzles

There are two puzzles in IEEE754:

  1. Why does IEEE754 use a bias number?
  2. Why does IEEE754 use power 2 instead of power 10?
    For converting a negative number, integer use the first bit to indicate the sign, the same method applies in IEEE754. In the exponent part, we can still apply this method, but IEEE754 use a bias number to represent the negative number. When we need to compare two float numbers, storing exponent parts like integer must decode first and then compare them. Using a bias number can avoid decoding. For example, we want to compare 3.14 and -3.14. Their binary is representing:
11000000010010001111010111000011
10111111010010001111010111000011

We can compare them directly and don't need to decode them first.

IEEE754 use power 2 trade off the time and space against the precision. Because Calculating power 2 is fast than power 10, and storing the fraction to binary can reduce the space. For example, if we want to store 0.5, storing decimal to binary must occupy 3 bits. If we convert it to binary directly, we just need to occupy one bit:

101  //decimal fraction
1    //binary fraction