What does MATLAB actually do when switching precisions?

8 views (last 30 days)
Suppose you have a single precision number a = randn('single'), then I increase its precision into double by b = double(a). In other programming languages, it will do the following three things regarding to the binary representation of a,
  1. Sign Bit: Remains the same.
  2. Exponent Bits: Extended to fill the larger space. In single precision, 8 bits are used for the exponent, while in double precision, 11 bits are used. The value of the exponent is adjusted accordingly to maintain the numerical value of the original number.
  3. Mantissa Bits: The mantissa is widened by adding additional bits with zeros.
Whereas, let us do an example in MATLAB
rng(1); a = randn('single'); b = double(a); format long; disp(a); disp(b);
This will gives
-0.6490138
-0.649013757705688
Now, let us use Julia (I didn't find out an easy way to convert from decimal to binary in MATLAB) to convert these numbers into their corresponding binary representations by
a = convert(Float32,-0.6490138);
b = convert(Float64,-0.649013757705688);
c = bitstring(a); println(a); println(c);
d = bitstring(b); println(b); println(d);
Then the result is
-0.6490138
10111111001001100010010111000101
-0.649013757705688
0100110001001011100001111111111111111111111111111100
Clearly, it does not simply adding zeros at the end of its binary representation.
Now, I wonder what MATLAB do exactly when dealting with such situation? It clearly not randomly inserting numbers since if I change the random number generator's seed, the extended number remain the same.

Accepted Answer

Matt J
Matt J on 10 Mar 2024
Edited: Matt J on 10 Mar 2024
I don't know Julia, but I don't believe the binary decomposition was done correctly. Below is the way to do the binary decomposition in Matlab. As you can see, the mantissa of b is simply a zero-padding of the mantissa of a.
rng(1); a = randn('single'); b = double(a);
[ea,ma]=decomp(a);
[eb,mb]=decomp(b);
ma,mb %mantissas
ma = '01001100010010111000100'
mb = '0100110001001011100010000000000000000000000000000000'
ea,eb %exponents
ea = '01111110'
eb = '01111111110'
The exponent of b is not a simple zero-padding of the exponent of a, because the exponent representation for single floats has an offset of 127,
whereas for doubles the offset is 1023,
However, we can easily verify the exponents represent the same thing by accounting for these offsets:
bin2dec(ea)-127 == bin2dec(eb)-1023
ans = logical
1
function [e,m]=decomp(q)
switch class(q)
case 'single'
s = dec2bin(typecast(q,'uint32'));
e=s(2:9);
m=s(10:end);
case 'double'
s = dec2bin(typecast(q,'uint64'));
e=s(2:12);
m=s(13:end);
end
end
  2 Comments
Clement
Clement on 10 Mar 2024
Thanks for your reply!!! This makes more sense! Probably I shall never change language during programming.
May I ask how you come up with using s = dec2bin(typecast(q,'uint64')); to transform a decimal representation to binary representation? I can't see easily how uint64 (integers) and binary64 (floating point numbers) are connected. Could you please provide some relevant reading?
Matt J
Matt J on 10 Mar 2024
Thanks for your reply!!!
You are quite welcome. If this answers your question, though, please Accept-click the answer.
May I ask how you come up with using s = dec2bin(typecast(q,'uint64')); to transform a decimal representation to binary representation?
You can find documentation for all Matlab commands online. typecast will convert the variable to a different data type, but with the same bit representation. By using it to convert to an integer type, we can then use dec2bin to obtain the bit string.

Sign in to comment.

More Answers (1)

James Tursa
James Tursa on 10 Mar 2024
Edited: James Tursa on 10 Mar 2024
When converting from IEEE single to double, the sign bit is retained, the exponent bits are adjusted to match the same exponent value for a wider field and different bias, and the significand is 0 padded. That is the closest double to the single in question, and in fact it represents exactly the same value in binary and decimal. The decimal differences you are apparently seeing in the conversion is only a MATLAB display issue, nothing else. E.g., taking your example:
>> rng(1); a = randn('single'); b = double(a); format long; disp(a); disp(b);
-0.6490138
-0.649013757705688
[' ' dec2bin(hex2dec(num2hex(a)),32)] % front shifted so significand bits line up
ans =
' 10111111001001100010010111000100'
>> dec2bin(hex2dec(num2hex(b)),64)
ans =
'1011111111100100110001001011100010000000000000000000000000000000'
You can clearly see that 0 bits were simply padded in the double binary representation. The exponent bits don't match exactly but that is to be expected because a different bit field width and bias is used for double vs single precision. But why do the values look different when displayed? Let's look at the exact conversion from binary floating point to decimal for these two cases:
>> fprintf('%40.25f\n',a)
-0.6490137577056884765625000
>> fprintf('%40.25f\n',b)
-0.6490137577056884765625000
Exactly the same. Again, the fact that the single display in MATLAB ends with an 8 and the double ends with a 75... is just a display issue because MATLAB stops the single display at a certain number of digits and rounds it for display, but for the double it prints more digits so the rounding doesn't occur until later. Both values are rounded for display, but at different points because of the variable type. Regardless, the underlying values of the single and double precision variables are exactly the same.
I can't imagine this happens any differently for IEEE in any other programming language because I would wager they all use the same or equivalent CPU op codes for the conversion in the background ... i.e. the CPU itself is doing the conversion as opposed to the language using custom code for the conversion. If you find such an example and can clearly demonstrate it is not simply a display issue please post it here.
I can think of a non-IEEE example that would not match, and that is converting from a hybrid 64-bit VAX D-FLOAT to 64-bit VAX G-FLOAT. The 64-bit D-FLOAT is a hybrid format has the same exponent bits as a single precision 32-bit F-FLOAT format single precision variable, just 32 extra bits of significand. Essentially a double precision type with the range of a single. But the 64-bit G-FLOAT type has a wider exponent bit field (with range more like a IEEE double) and consequently has fewer significand bits than a 64-bit D-FLOAT has. So when converting from D-FLOAT to G-FLOAT there will be some rounding and precision loss in the significand bits, even though they are both double precision.

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!