How can we represent numbers with a point in computer?

Fixed point numbers

We know that we can represent 32^2 different numbers with 32 bits, which gives us almost 4 billion distinct numbers. An intuitive idea for representing numbers with a point is to fix the point position. Namely, we use 4 bits to represent integers in decimal (0 to 9), so 32 bits would represent 8 such integers. Then we take the rightmost 2 integers as the fractional part, and fix the point right at that position. In this way, we can represent 100 million real numbers from 0 to 999999.99 in 32 bits.

Such binary representation of point decimal is called BCD (Binary-Coded Decimal). It is most often used in supermarkets and banks where we have decimals up to the cent position.

However, the drawback is clear:
1. Waste of bits. With 32 bits we could represent 4 billion different numbers theoretically, but we can only represent 100 million numbers in BCD
2. There is no way to represent huge numbers and tiny numbers simultaneously in this way. We can't let our computers deal only a small fixed range of numbers.

Floating point numbers

V = (-1)^s * M * (2)^E

where
1. (-1)^s denotes the sign bit, when s=0, V is positive; when s=1, V is negative.
2. M indicates a valid number, greater than or equal to 1, less than 2.
3. 2^E denotes the exponent bit.

IEEE 754 specifies that for 32-bit floating-point numbers, the highest 1 bit is the sign bit s, followed by the next 8 bits are the exponent E, and the remaining 23 bits are the significant digits M.

For 64-bit floating-point numbers, the highest 1 bit is the sign bit S, followed by the next 11 bits are the exponent E, and the remaining 52 bits are the significant digits M.

Note that IEEE 754 has some special rules for the valid number M and exponent E:

Bfloat16

The bfloat16 format was developed by Google Brain, an artificial intelligence research group at Google. It is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.

The main difference is that bfloat16 supports only an 8-bit significand rather than the 24-bit significand of the binary32 format.

Reference

Wiki: Computer number format\
Wiki: bfloat16