Another unfortunate fact is that max int (signed and unsigned) is also not a float. This means you cannot write clamped ftoi conversion only in floating point (because the value is not representable). This is why webgpu (wgsl) does not fully saturate on ftoi
Unsigned ints are the non-negative integers mod 2^n.
Signed ints behave like the integers in some tiny subset of representable values. Maybe it's something like the interval (-sqrt(INT_MAX), sqrt(INT_MAX)).
Signed ints are also the integers mod 2^n. The beauty of modular arithmetics is that it's all equivalent. At least for all the operations that work in modular arithmetics in the first place. They just have different canonical representatives for their respective equivalence classes, which are used for the operations that don't work in modular arithmetics (like divisions, comparisons or conversions to string with a sign character).
Not in C. In C signed integer overflow is underined behaviour that may or may not be compiled to the equivalent of mod arithmetic dependingonthe whims of the compiler.
"Bounded 2-adic integers" would only make sense if you were bounding the 2-adic norm. Integers mod 2^n would be closer to "approximate fixed-point 2-adic integers".
(Alas, most languages don't expose a convenient multiplicative inverse for their integer types, and it's a PITA to write a good implementation of the extended Euclidean algorithm every time.)
The important point is that the arithmetic operators on int perform modulo arithmetics, not the normal arithmetics you would expect on unbounded integers. This is often not explained when first teaching ints.
That’s not the notion of ints the article, nor GP by “computer ints”, was referring to. Python is rather atypical in its nomenclature here. Arbitrary-precision integers are generally called “integer” or something like “bigint”.
Are they even reals? Math classes were a while ago at this point, but I'm fairly convinced they're just rationals. Not trying to be pedantic, just wondering.
I think the parent comment is saying it's confusing to associate floats with decimals like 0.123.
Instead it's more accurate to think of them as being in scientific notation like 1.23E-1.
In this notation it's clearer that they're sparsely populated because some of the 32 bits encode the exponent, which grows and shrinks very quickly.
But yes rationals are reals. It's clear that you can't represent, say, all digits of pi in 32 bits, so the parent comment was not saying that 32 bit floats are all of the reals.
Yeah, that's fair. Personally, I like to think of them as a log compressed way of expressing fractional values, like how one would record in log with a camera to capture rich dark scenes while maintaining highlight detail. I think the bigger problem with floats is that the types and operations around them are pretty loose and permissive, although maybe I just don't appreciate how well the usual compromises work. Was pretty cool to dig into arbitrary precision math libraries a while back though, found some fun stuff in there. Also found out that my Android phone's Calculator app is not calculating in base 10, unlike Windows' Calculator...
Fair, I was largely riffing off the theme in the post. That is, I took it as an implicit idea that floats represent real numbers.
That said, I'd argue that they are neither reals nor rationals. They are scientific numbers with no syntax to indicate "repeated" tails. Would be like saying that you want someone to represent 1/3 using 6 digits with no bar notation. Best you can do is "0.33333" and that just isn't the same. Moving it so that you have 6 digits with 2 being exponent, you are stuck with "3.333e-01". Which is just different still.
Yeah, there's a similar comment in another subthread a bit below.
The way I like to frame it, and this will almost certainly not be mathematically rigorous, is that every number a float (as in, the formats defined in IEEE-754) can hold can be sufficiently described as at most a rational, though they cannot represent all rationals, and cannot represent anything "higher" than rationals. That's why I prefer to say they're representing rationals. It's in the sense that they're all representing some rational, not any arbitrary rational.
To do that would require infinite space. E.g. for one third, you'd need infinite binary digits (or even infinite decimal ones, as you say). It's just not how IEEE-754 floats work, as you mention.
This launched me into a research on "perfect numerical accuracy" a while back, and there I did find schemes where you store the numerator and the denominator separately, freeing you from this specific problem. It's a fun topic.
Ah, I should have looked at more of the responses. This amusingly blew up on me well after I stopped watching. :D
I don't tend to give much thought to real versus rational. Such that I still find it not that surprising to see people treat floats as reals. Especially since most languages make it hard to represent anything else. To that end, it will take me a bit to really internalize what you are saying here. I think I understand it.
That said, my favorite for the craziness of "old is new" is that the literal "1/3" works to represent a rational in Common Lisp. Indeed, I had originally thought that they had some specific constants for common fractions defined. Nope, they just support rational literals.
And I question if you need infinite space to represent repeated decimals. Strictly, you just need a way to indicate the repeating. No?
> Strictly, you just need a way to indicate the repeating. No?
It depends on your data structure, yes. You can also do the separate numerator / denominator thing, and I'm sure there are other ways too. Just that naively if you try representing it, that's when you need infinite space. Or if you do it the way IEEE-754 formats do.
Agreed that there are other ways, almost certainly. I was going with the assumption that we wanted to store positional values for that question.
That is, it is only when you assume that you have to write out all decimal digits that you get people writing silly things that computers do all of the time. "0.66666...7" is an easy example. Ironically, to me, is that in written assignments you likely would have gotten knocked off points for not writing 0.(6) where () means an overbar. You definitely would lose points for rounding at the end.
I'm not sure what you mean? 1/7 repeats, as well. As does 1/14. As do... a lot of fractions. https://projecteuler.net/problem=26 was a fun problem to explore for some that I remember. Pretty sure there were other problems on this general theme, as well.
My point was strictly that we have "bar notation" in writing to show that 1.3 is not the same as 1.33 or 1.333 or 4/3. No matter how many 3s you put at the end. I don't know of any similar scheme in computers. I'm assuming it has been tried.
1/7 is 142857/999999 from which representation you can readily infer that it is 0.142857 with a bar over it. The 9s are verbose but mean the same thing as a bar over as many fractional digits.
Ah, I see your point, I think. You are just pointing out that any repeating decimal is a ratio. Which, fair, but my question was if a bar representation was ever tried. The idea being that you don't need to have infinite storage to represent a ratio in positional digits any more than you have to use infinite paper to write a ratio's value out.
That is, yes, I know that 1/6 can be used to represent 0.1(6), but if you are already storing something in positional digits, there may have been a benefit to keeping it in positional digits? I'm assuming there was not, in fact, any benefit?
Floats [if you ignore -0.0, infinities, and NaNs] are a subset of the rationals, themselves a subset of the real numbers.
It's generally accurate to consider floats an acceptable approximation of the [extended] reals, since it's possible to do operations on them that don't exist for rational numbers, like sqrt or exp.
> since it's possible to do operations on them that don't exist for rational numbers, like sqrt or exp
This kinda sent me on a spin, for a moment I thought my whole life was a lie and these functions don't take rationals as inputs somehow. Then I realized you mean rather that they typically produce non-rationals, so the outputs will be approximated.
I think that's more of how one frames it, no? Like you won't be able to store any arbitrary rational in a float, as you'd need arbitrarily large storage for that. But all the numbers a float can store are rationals (so excluding all the fanciful IEEE-754 features of course).
It's not so much the need for arbitrary storage, the problem is that even easy rationals can't be expressed in the IEEE floats
Take realistic::Rational::fraction(1, 3) ie one third. Floats can't represent that, but we don't need a whole lot of space for it, we're just storing the numerator and denominator.
If we say we actually want f64, the 8 byte IEEE float, we get only a weak approximation, 6004799503160661/18014398509481984 because 3 doesn't go neatly into any power of 2.
Edited: An earlier version of this comment provided the 32-bit fraction 11184811/33554432 instead of the 64-bit one.
This should be obvious. There are the same number of 32-bit integers as 32-bit floats [0], so for every float that is not an int, there exists an int that is not a float. Clearly most floats cannot be represented as integers, so the converse must be true as well.
But I still see people building systems where implicit conversation of float to int is not allowed because "it would lose precision", but that allow int to float.
It is shocking to think about it, but a lot of programmers don’t think about bits at all, and just think of floats as reals. Or maybe if you prod them enough, they’ll treat the floats as sometimes inaccurate reals.
Because int -> float is well defined. If you cast an int to a float, you’re getting a value that represents the first six significant figures of the original integer. Always.
The precision of casting float to int, on the other hand, depends on the input.
I try not to be too picky about this sort of stuff, but “decimals” is really bad, haha! It is like wrong in a way that is specific, which somehow makes it even more wrong than just saying “approximately reals.”
IEEE actually has a decimal float format defined. I don’t know if any system uses it, though (my suspicion is it must be pretty niche).
Sure but at least in my experience it's rare to convert 32 bit ints to 32 bit float. Usually the conversion is 32 bit int to 64 bit float, which is always safe.
Floats have a fixed accuracy everywhere. Ints have a variable accuracy across their range. When you convert from int to float, the number stays accurate to however many digits, say 8 digits for single precision. Ints have a fixed precision everywhere, whereas for floats it varies. A given float might be accurate to one part in 10^30. When you cast to int, you get accuracy to within 1 part in 1. So you lose far more orders of magnitude of precision converting from float to int then you lose in order of magnitude of accuracy converting from int to float.
Floats have the same accuracy everywhere when measured in relative error, ints have the same accuracy everywhere when measuring absolutely.
Which of those is better? It depends on the application. All you can do is hope the person who gave you the numbers chose the more appropriate representation, the less painful way to lose fidelity in a representation of the real world. By converting to the other representation you now have lost that fidelity in both ways.
Again, by a similar argument to above, something like this has to be true. You have exactly 32 bits of information either way, and both conversions lose the same amount of information - you end up with a 32-bit representation, but it only represents a domain of 2^27 (or whatever it is) distinct numbers.
>both conversions lose the same amount of information
This is exactly what's untrue. I said "orders of magnitude", but I could have said bits - they're analogous concepts. Ints with the high 9 bits set to zero lose no information when converting to single precision floating point. The rest lose progressively more information, concentrated in the low order bits, up to 8 bits. Certainly regrettable, but usually manageable. Converting from floats to ints might be representable exactly, but if the number is, say, near 1, then it will be rounded to 1, losing 23 bits of information. If the number is near 0 then it will be rounded to 0, losing 31 bits of information.
> When you convert from int to float, the number stays accurate to however many digits
The point of the article is that this "however many digits" actually implies rounding many numbers that aren't that big. A single precision (i.e. 32-bit) float cannot exactly represent some 32-bit integers. For example 1_234_567_891f is actually rounded to 1_234_567_936. This is because there are only 23 bits of fraction available (https://en.wikipedia.org/wiki/Single-precision_floating-poin...).
You have a typo: In your last sentence you effectively wrote «from int to float» twice in contradicting ways. «To float from int than (…) from int to float».
I have been writing software for decades in areas ranging from industrial control to large business systems. I almost never use floats or doubles. In almost all cases 32 bit integers (and sometimes U64int) with some scaling suffices.
Perhaps it is the FORTRAN overhang of engineering education that predisposes folks to using floats when Int32/64 would be fine.
I shudder as to why in JavaScript they made Number a float/double instead of an integer. I constantly struggle to coerce JS to work with integers.
This is a fairly obvious one? Although I've mainly encountered the effect with long to double conversion.
On the other hand, floating point is the gift that never stops giving.
A recent wtf I encountered was partly caused by the silent/automatic conversion/casting from a float to double. Nearly all C-style languages do this even though it's unsafe in its own way. Kinda obvious to me now and when I state it like this (using C-style syntax), it looks trivial but: (double) 0.3f is not equal to 0.3d
The wtfness of it was mostly caused by other factors (involving overloaded method/functions and parsing user input) but I realized that I had never really thought about this case before - probably because floats are less common than doubles in general - and without thinking about it, sort of assumed it should be similar to int to long conversion for example (which is safe).
Float to double conversion is safe. The thing that's not safe in your example is to write `0.3f` (or `0.3d`) in the first place. The conversions from decimal to float or double is unsafe (inexact) and gives different results from floats than for doubles. But the conversion of float to double, in itself, is always exact.
The float representation of 0.3 (e.g.) does not, when cast to double, represent 0.3 - in contrast the i32 representation of any number when cast to i64 represents the same number.
This sort of thing is why you never compare floating point numbers for equality. Always compare using epsilons appropriately chosen for the circumstances.
Always... except when you're actually writing floating point operations. If I'm implementing `sqrt`, I'd better make sure the result is exactly the expected one. Epsilon is 0 in all my unit tests. ;)
This is true! As an implementor of floating point operations, you want to be exactly correct for all values (glares at Intel :P ) However, as a consumer of said floating point operations, I'm still gonna treat everything as Sneaky And Not To Be Trusted. :P Especially in C++ where this attribute is largely supplied by the compiler being /r/iamverysmart about this sort of thing.
... but people are in the habit of using doubles. Many languages, like Javascript, only support doubles and int32(s) do embed in doubles.
I have some notes for a fantasy computer which is maybe what would have happened if Chinese people [1] evolved something like the PDP-10 [2] Initially I was wanting a 24-bit wordsize [3] but decided on 48-bit [4] because you can fit 48 bits into a double for a Javascript implementation.
[1] There are instructions to scan UTF-8 characters and the display system supports double-wide bitmap characters that are split into halves that are indexed with 24-bit ints.
[2] It's a load-store architecture but there are instructions to fetch and write 0<n<48 bits out of a word even overlapping two words, which makes [1] possible; maybe that write part is a little unphysical
[3] I can't get over how a possible 24-bit generation didn't quite materialize in the 1980s, and find the eZ80 evokes a kind of nostalgia for an alternate history
[4] In the backstory, it started with a 24-bit address space like the 360 but got extended to have "wide pointers" qualified by an address space identifier (instead of the paging-oriented architecture the industry) really took as well as "deep pointers" which specify a bitmap, 48-bit is enough for a pointer to be deep and wide and have some tag bits. Address spaces can merge together contiguously or not depending on what you put in the address space table.
"I can't get over how a possible 24-bit generation didn't quite materialize in the 1980s, and find the eZ80 evokes a kind of nostalgia for an alternate history"
Well... it depends on how you look at it.
While the marketers tried to cleanly delineate generations into 8- and 16- and 32-bit eras, the reality was always messier. What exactly the "bits" were that were being measureds was not consistent. The size of a machine word in the CPU was most common, and perhaps in some sense objectively the cleanest, but the number of bits of the memory bus started to sneak in at times (like the "64 bit" Atari Jaguar with the 32-bit CPU because one particular component was 64 bits wide). In reality the progress was always more incremental and there are some 24-bit things, like, the 286 can use 24 bits to access memory, and a lot of "32 bit graphics" is really 24 bits because 8 bits for RGB gets you to 24 bits. The lack of a "24-bit generation" is arguably more about the marketing rhetoric than the lack of things that were indeed based around 24 bits in some way.
Even today our "64-bit CPUs" are a lot messier than meets the eye. As far as I know, they can't actually address 64 bits of RAM, there are some reserved higher bits, and depending on which extensions you have, modern CPUs may be able to chew on up to 512 bits at a time with a single instruction, and I could well believe someone snuck something that can chew on 1024 bits without me noticing.
In the world of mainframes and minicomputers there were several 36bit machines. They were chosen because you could pack 6x6bit char codes into one word. Yes, back then ASCII was entirely just the uppercase subset. Off hand I can't recall exactly how EBCDIC and 80 col card codes were mapped.
Something I found really annoying in the Avro spec is that they automatically convert between ints/longs and floats/doubles in their backwards compatibility system. That just seemed like an unforced error to me. (Maybe it's changed in newer versions of the standard?)
Here is a visualization I made recently on the density of float32. It seems that float32 is basically just PCM, which was a lossy audio compression exploiting the fact that human hearing has logarithmic sensitivity. I’m not sure why they needed the mantissa though. If you give all 31 bits to the exponent, then normalize it to +/-2^7, you get a continuous version of the same function.
So, PCM isn't the thing you meant here. PCM just means Pulse-code modulation, you're probably thinking of a specific non-linear PCM and maybe that was even the default for some particular hardware or software you used, but that's not what PCM itself means and these days almost everything uses Linear PCM.
Wow. G.711 is extremely obsolete. Interpreting PCM as G.711 (which is from the 1970s) is about similar to if somebody said "Windows" but meant Windows 2.x the 1980s DOS-based Microsoft GUI. I guess I don't have to feel like I'm the oldest person reading HN.
float has higher accuracy around 1.0 than around 2*24. This makes it quite a bit different from PCM which is fully linear. Which is probably why floating point PCM keeps it's samples primarily between -1.0 and +1.0.
> which was a lossy audio compression
It's not lossy. Your bit depth simply defines the noise floor which is the smallest difference in volume you can represent. This may result in loss of information but at even 16 bits only the most sensitive of ears could even pretend to notice.
> If you give all 31 bits to the exponent, then normalize it to +/-2^7, you get a continuous version of the same function.
You'll extend the range but loose all the precision. This is probably the opposite of what any IEE754 user actually wants.
I don't have the time to fully analyze this but my concern would be here:
exponent *= 2 ** (8 - E)
In the E=8 case then this is just `* 1`. In the E=31 case this is now `* 2*-23`. Python is going to do all of this for you in the float64 domain. I think it's possible that you haven't graphed what you intended.
You also don't have subnormals, infinities or propagating NaNs. You manage to only retain the signed 0.
EDIT: And the midpoint of your system is 0.5. Which is a little uncomfortable.
Without a mantissa, way too much precision is allocated to the near zero range and not enough to the "near infinity" range. Consider that without a mantissa, the second largest float is only half of the largest float. With a 23 bit mantissa, there are 2^23 floats from half the largest to the largest.
You could change the scaling factor to target any bounds you want. On average the precision is equal. The mantissa just adds linear segments to a logarithmic curve.
> On average the precision is equal. The mantissa just adds linear segments to a logarithmic curve.
Yes, exactly; the linear regions are needed to more evenly distribute precision, while the average precision remains the same. Alternatively, you can omit the mantissa, but use an exponent base much closer to 1 (perhaps 1 + 2⁻²³).
Far more common than what exactly? Because if you look at the range of the exponent, you should still come to the conclusion that floats can represent more non-integer numbers than integer numbers.
True but misleading. Like one of the other commenters also mentioned, most integers are small. For 32-bit you might run into integers above 2^23 bits once in a while, but for 64-bits it’s really not that common to have integers above 2^53, unless it’s a bit pattern instead of a natural number. So you could reasonably say “most integers _are_ 64-bit floats.”
A little known fact is that some intel processors use 80 bit registers to store double values, which can result in programs behaving differently based on the optimization level, because if values are stored in memory, like when building for debug, they are rounded.
Another unfortunate fact is that max int (signed and unsigned) is also not a float. This means you cannot write clamped ftoi conversion only in floating point (because the value is not representable). This is why webgpu (wgsl) does not fully saturate on ftoi
https://www.w3.org/TR/WGSL/#floating-point-conversion
It is a shame we tend to teach floats as the computer version of reals. Thinking of them as "scientific numbers" really helps a ton with this.
I want to make a Birds Aren't Real[0] style t-shirt that says "Floats Aren't Real"
[0]: https://en.wikipedia.org/wiki/Birds_Aren%27t_Real
I would buy this T-shirt. I would also take "Floats Aren't Normal"
True, but we also have to be careful about teaching ints as the computer version of integers.
Unsigned ints are the non-negative integers mod 2^n.
Signed ints behave like the integers in some tiny subset of representable values. Maybe it's something like the interval (-sqrt(INT_MAX), sqrt(INT_MAX)).
Signed ints are also the integers mod 2^n. The beauty of modular arithmetics is that it's all equivalent. At least for all the operations that work in modular arithmetics in the first place. They just have different canonical representatives for their respective equivalence classes, which are used for the operations that don't work in modular arithmetics (like divisions, comparisons or conversions to string with a sign character).
Not in C. In C signed integer overflow is underined behaviour that may or may not be compiled to the equivalent of mod arithmetic dependingonthe whims of the compiler.
C oddities should be relegated to a footnote, not define what computer science is.
That's one way of looking at them. You can also look at the signed integers as bounded 2-adic numbers.
"Bounded 2-adic integers" would only make sense if you were bounding the 2-adic norm. Integers mod 2^n would be closer to "approximate fixed-point 2-adic integers".
(Alas, most languages don't expose a convenient multiplicative inverse for their integer types, and it's a PITA to write a good implementation of the extended Euclidean algorithm every time.)
wait, why?
Proper integers aren’t bounded. Computer ints are.
Unbounded integer types exist, which have infinite precision in theory and are only limited by available memory in practice.
You can make the argument that "proper" integers are also bounded in practice by limitations of our universe :)
Unbounded integer types aren't ints.
The important point is that the arithmetic operators on int perform modulo arithmetics, not the normal arithmetics you would expect on unbounded integers. This is often not explained when first teaching ints.
In many langue’s (e.g. Python, so not even obscure languages), ints are unbounded.
That’s not the notion of ints the article, nor GP by “computer ints”, was referring to. Python is rather atypical in its nomenclature here. Arbitrary-precision integers are generally called “integer” or something like “bigint”.
Are they even reals? Math classes were a while ago at this point, but I'm fairly convinced they're just rationals. Not trying to be pedantic, just wondering.
I think the parent comment is saying it's confusing to associate floats with decimals like 0.123.
Instead it's more accurate to think of them as being in scientific notation like 1.23E-1.
In this notation it's clearer that they're sparsely populated because some of the 32 bits encode the exponent, which grows and shrinks very quickly.
But yes rationals are reals. It's clear that you can't represent, say, all digits of pi in 32 bits, so the parent comment was not saying that 32 bit floats are all of the reals.
Yeah, that's fair. Personally, I like to think of them as a log compressed way of expressing fractional values, like how one would record in log with a camera to capture rich dark scenes while maintaining highlight detail. I think the bigger problem with floats is that the types and operations around them are pretty loose and permissive, although maybe I just don't appreciate how well the usual compromises work. Was pretty cool to dig into arbitrary precision math libraries a while back though, found some fun stuff in there. Also found out that my Android phone's Calculator app is not calculating in base 10, unlike Windows' Calculator...
Which is what scientific notation is.
Fair, I was largely riffing off the theme in the post. That is, I took it as an implicit idea that floats represent real numbers.
That said, I'd argue that they are neither reals nor rationals. They are scientific numbers with no syntax to indicate "repeated" tails. Would be like saying that you want someone to represent 1/3 using 6 digits with no bar notation. Best you can do is "0.33333" and that just isn't the same. Moving it so that you have 6 digits with 2 being exponent, you are stuck with "3.333e-01". Which is just different still.
Yeah, there's a similar comment in another subthread a bit below.
The way I like to frame it, and this will almost certainly not be mathematically rigorous, is that every number a float (as in, the formats defined in IEEE-754) can hold can be sufficiently described as at most a rational, though they cannot represent all rationals, and cannot represent anything "higher" than rationals. That's why I prefer to say they're representing rationals. It's in the sense that they're all representing some rational, not any arbitrary rational.
To do that would require infinite space. E.g. for one third, you'd need infinite binary digits (or even infinite decimal ones, as you say). It's just not how IEEE-754 floats work, as you mention.
This launched me into a research on "perfect numerical accuracy" a while back, and there I did find schemes where you store the numerator and the denominator separately, freeing you from this specific problem. It's a fun topic.
Ah, I should have looked at more of the responses. This amusingly blew up on me well after I stopped watching. :D
I don't tend to give much thought to real versus rational. Such that I still find it not that surprising to see people treat floats as reals. Especially since most languages make it hard to represent anything else. To that end, it will take me a bit to really internalize what you are saying here. I think I understand it.
That said, my favorite for the craziness of "old is new" is that the literal "1/3" works to represent a rational in Common Lisp. Indeed, I had originally thought that they had some specific constants for common fractions defined. Nope, they just support rational literals.
And I question if you need infinite space to represent repeated decimals. Strictly, you just need a way to indicate the repeating. No?
> Strictly, you just need a way to indicate the repeating. No?
It depends on your data structure, yes. You can also do the separate numerator / denominator thing, and I'm sure there are other ways too. Just that naively if you try representing it, that's when you need infinite space. Or if you do it the way IEEE-754 formats do.
Apologies, almost missed this response.
Agreed that there are other ways, almost certainly. I was going with the assumption that we wanted to store positional values for that question.
That is, it is only when you assume that you have to write out all decimal digits that you get people writing silly things that computers do all of the time. "0.66666...7" is an easy example. Ironically, to me, is that in written assignments you likely would have gotten knocked off points for not writing 0.(6) where () means an overbar. You definitely would lose points for rounding at the end.
Repeating decimal fractions are just rationals with 9s in the denominator: .123123123... = 123/999. Fourth grade long division.
I'm not sure what you mean? 1/7 repeats, as well. As does 1/14. As do... a lot of fractions. https://projecteuler.net/problem=26 was a fun problem to explore for some that I remember. Pretty sure there were other problems on this general theme, as well.
My point was strictly that we have "bar notation" in writing to show that 1.3 is not the same as 1.33 or 1.333 or 4/3. No matter how many 3s you put at the end. I don't know of any similar scheme in computers. I'm assuming it has been tried.
1/7 is 142857/999999 from which representation you can readily infer that it is 0.142857 with a bar over it. The 9s are verbose but mean the same thing as a bar over as many fractional digits.
Ah, I see your point, I think. You are just pointing out that any repeating decimal is a ratio. Which, fair, but my question was if a bar representation was ever tried. The idea being that you don't need to have infinite storage to represent a ratio in positional digits any more than you have to use infinite paper to write a ratio's value out.
That is, yes, I know that 1/6 can be used to represent 0.1(6), but if you are already storing something in positional digits, there may have been a benefit to keeping it in positional digits? I'm assuming there was not, in fact, any benefit?
Floats [if you ignore -0.0, infinities, and NaNs] are a subset of the rationals, themselves a subset of the real numbers.
It's generally accurate to consider floats an acceptable approximation of the [extended] reals, since it's possible to do operations on them that don't exist for rational numbers, like sqrt or exp.
> since it's possible to do operations on them that don't exist for rational numbers, like sqrt or exp
This kinda sent me on a spin, for a moment I thought my whole life was a lie and these functions don't take rationals as inputs somehow. Then I realized you mean rather that they typically produce non-rationals, so the outputs will be approximated.
They aren’t reals. They aren’t continuous and are bounded.
And the operators +, -, *, and / lack some of the properties of addition, subtraction, multiplication, and division.
Tom7 has a good video about this: https://www.youtube.com/watch?v=5TFDG-y-EHs
Yes, all floating numbers are rational. (It's also true that they are all reals but i get your point.)
They aren’t pure rational either. They are a subset of rational numbers.
I think that's more of how one frames it, no? Like you won't be able to store any arbitrary rational in a float, as you'd need arbitrarily large storage for that. But all the numbers a float can store are rationals (so excluding all the fanciful IEEE-754 features of course).
It's not so much the need for arbitrary storage, the problem is that even easy rationals can't be expressed in the IEEE floats
Take realistic::Rational::fraction(1, 3) ie one third. Floats can't represent that, but we don't need a whole lot of space for it, we're just storing the numerator and denominator.
If we say we actually want f64, the 8 byte IEEE float, we get only a weak approximation, 6004799503160661/18014398509481984 because 3 doesn't go neatly into any power of 2.
Edited: An earlier version of this comment provided the 32-bit fraction 11184811/33554432 instead of the 64-bit one.
The pitfalls of floats were taught when I was a college math major. That was in the mid 80s.
Isn’t the `real` datatype in Fortran a float64? Or am I making that up?
It was on the Cray-1, although not in the format or with the operations you know today.
But in our IEEE-754 modern world, no, it’s not.
Makes sense - thanks for clarifying!
This should be obvious. There are the same number of 32-bit integers as 32-bit floats [0], so for every float that is not an int, there exists an int that is not a float. Clearly most floats cannot be represented as integers, so the converse must be true as well.
But I still see people building systems where implicit conversation of float to int is not allowed because "it would lose precision", but that allow int to float.
[0] don't reply to me about NaNs, please
It is shocking to think about it, but a lot of programmers don’t think about bits at all, and just think of floats as reals. Or maybe if you prod them enough, they’ll treat the floats as sometimes inaccurate reals.
Because int -> float is well defined. If you cast an int to a float, you’re getting a value that represents the first six significant figures of the original integer. Always.
The precision of casting float to int, on the other hand, depends on the input.
I cringe whenever I hear floats described as "decimals" :(
I try not to be too picky about this sort of stuff, but “decimals” is really bad, haha! It is like wrong in a way that is specific, which somehow makes it even more wrong than just saying “approximately reals.”
IEEE actually has a decimal float format defined. I don’t know if any system uses it, though (my suspicion is it must be pretty niche).
Sure but at least in my experience it's rare to convert 32 bit ints to 32 bit float. Usually the conversion is 32 bit int to 64 bit float, which is always safe.
Floats have a fixed accuracy everywhere. Ints have a variable accuracy across their range. When you convert from int to float, the number stays accurate to however many digits, say 8 digits for single precision. Ints have a fixed precision everywhere, whereas for floats it varies. A given float might be accurate to one part in 10^30. When you cast to int, you get accuracy to within 1 part in 1. So you lose far more orders of magnitude of precision converting from float to int then you lose in order of magnitude of accuracy converting from int to float.
Floats have the same accuracy everywhere when measured in relative error, ints have the same accuracy everywhere when measuring absolutely.
Which of those is better? It depends on the application. All you can do is hope the person who gave you the numbers chose the more appropriate representation, the less painful way to lose fidelity in a representation of the real world. By converting to the other representation you now have lost that fidelity in both ways.
Again, by a similar argument to above, something like this has to be true. You have exactly 32 bits of information either way, and both conversions lose the same amount of information - you end up with a 32-bit representation, but it only represents a domain of 2^27 (or whatever it is) distinct numbers.
>both conversions lose the same amount of information
This is exactly what's untrue. I said "orders of magnitude", but I could have said bits - they're analogous concepts. Ints with the high 9 bits set to zero lose no information when converting to single precision floating point. The rest lose progressively more information, concentrated in the low order bits, up to 8 bits. Certainly regrettable, but usually manageable. Converting from floats to ints might be representable exactly, but if the number is, say, near 1, then it will be rounded to 1, losing 23 bits of information. If the number is near 0 then it will be rounded to 0, losing 31 bits of information.
> When you convert from int to float, the number stays accurate to however many digits
The point of the article is that this "however many digits" actually implies rounding many numbers that aren't that big. A single precision (i.e. 32-bit) float cannot exactly represent some 32-bit integers. For example 1_234_567_891f is actually rounded to 1_234_567_936. This is because there are only 23 bits of fraction available (https://en.wikipedia.org/wiki/Single-precision_floating-poin...).
You have a typo: In your last sentence you effectively wrote «from int to float» twice in contradicting ways. «To float from int than (…) from int to float».
there was an error made when I went back to edit what I wrote...
> [0] don't reply to me about NaNs, please
The question is why they added so many NaNs to the spec, instead of just one. Probably for a signal, but who actually uses that?
For IEEE float16 the amount of lacking values due to an entire exponent value being needlessly taken up by NaNs is actually quite blatant.
You could store the address of the offending instruction/code sequence that generated the NaN (think software-only implementation).
> but that allow int to float.
And not think what happened when people do `i64 as usize` and friends
(This is one are where the pascal have it right, including the fact you should do loops like `for I in low(nuts)..high(nuts)`)
p.d: 'nuts' was the autocorrect choice that somehow is topical here so I keep it.
> This should be obvious
Yes
They are different types
They are different things
They are related concepts, that is all
I have been writing software for decades in areas ranging from industrial control to large business systems. I almost never use floats or doubles. In almost all cases 32 bit integers (and sometimes U64int) with some scaling suffices.
Perhaps it is the FORTRAN overhang of engineering education that predisposes folks to using floats when Int32/64 would be fine.
I shudder as to why in JavaScript they made Number a float/double instead of an integer. I constantly struggle to coerce JS to work with integers.
This is a fairly obvious one? Although I've mainly encountered the effect with long to double conversion.
On the other hand, floating point is the gift that never stops giving.
A recent wtf I encountered was partly caused by the silent/automatic conversion/casting from a float to double. Nearly all C-style languages do this even though it's unsafe in its own way. Kinda obvious to me now and when I state it like this (using C-style syntax), it looks trivial but: (double) 0.3f is not equal to 0.3d
The wtfness of it was mostly caused by other factors (involving overloaded method/functions and parsing user input) but I realized that I had never really thought about this case before - probably because floats are less common than doubles in general - and without thinking about it, sort of assumed it should be similar to int to long conversion for example (which is safe).
Float to double conversion is safe. The thing that's not safe in your example is to write `0.3f` (or `0.3d`) in the first place. The conversions from decimal to float or double is unsafe (inexact) and gives different results from floats than for doubles. But the conversion of float to double, in itself, is always exact.
I do not consider it safe.
The float representation of 0.3 (e.g.) does not, when cast to double, represent 0.3 - in contrast the i32 representation of any number when cast to i64 represents the same number.
This sort of thing is why you never compare floating point numbers for equality. Always compare using epsilons appropriately chosen for the circumstances.
Floats are Sneaky and Not to be Trusted. :D
Always... except when you're actually writing floating point operations. If I'm implementing `sqrt`, I'd better make sure the result is exactly the expected one. Epsilon is 0 in all my unit tests. ;)
This is true! As an implementor of floating point operations, you want to be exactly correct for all values (glares at Intel :P ) However, as a consumer of said floating point operations, I'm still gonna treat everything as Sneaky And Not To Be Trusted. :P Especially in C++ where this attribute is largely supplied by the compiler being /r/iamverysmart about this sort of thing.
... but people are in the habit of using doubles. Many languages, like Javascript, only support doubles and int32(s) do embed in doubles.
I have some notes for a fantasy computer which is maybe what would have happened if Chinese people [1] evolved something like the PDP-10 [2] Initially I was wanting a 24-bit wordsize [3] but decided on 48-bit [4] because you can fit 48 bits into a double for a Javascript implementation.
[1] There are instructions to scan UTF-8 characters and the display system supports double-wide bitmap characters that are split into halves that are indexed with 24-bit ints.
[2] It's a load-store architecture but there are instructions to fetch and write 0<n<48 bits out of a word even overlapping two words, which makes [1] possible; maybe that write part is a little unphysical
[3] I can't get over how a possible 24-bit generation didn't quite materialize in the 1980s, and find the eZ80 evokes a kind of nostalgia for an alternate history
[4] In the backstory, it started with a 24-bit address space like the 360 but got extended to have "wide pointers" qualified by an address space identifier (instead of the paging-oriented architecture the industry) really took as well as "deep pointers" which specify a bitmap, 48-bit is enough for a pointer to be deep and wide and have some tag bits. Address spaces can merge together contiguously or not depending on what you put in the address space table.
"I can't get over how a possible 24-bit generation didn't quite materialize in the 1980s, and find the eZ80 evokes a kind of nostalgia for an alternate history"
Well... it depends on how you look at it.
While the marketers tried to cleanly delineate generations into 8- and 16- and 32-bit eras, the reality was always messier. What exactly the "bits" were that were being measureds was not consistent. The size of a machine word in the CPU was most common, and perhaps in some sense objectively the cleanest, but the number of bits of the memory bus started to sneak in at times (like the "64 bit" Atari Jaguar with the 32-bit CPU because one particular component was 64 bits wide). In reality the progress was always more incremental and there are some 24-bit things, like, the 286 can use 24 bits to access memory, and a lot of "32 bit graphics" is really 24 bits because 8 bits for RGB gets you to 24 bits. The lack of a "24-bit generation" is arguably more about the marketing rhetoric than the lack of things that were indeed based around 24 bits in some way.
Even today our "64-bit CPUs" are a lot messier than meets the eye. As far as I know, they can't actually address 64 bits of RAM, there are some reserved higher bits, and depending on which extensions you have, modern CPUs may be able to chew on up to 512 bits at a time with a single instruction, and I could well believe someone snuck something that can chew on 1024 bits without me noticing.
In the world of mainframes and minicomputers there were several 36bit machines. They were chosen because you could pack 6x6bit char codes into one word. Yes, back then ASCII was entirely just the uppercase subset. Off hand I can't recall exactly how EBCDIC and 80 col card codes were mapped.
Something I found really annoying in the Avro spec is that they automatically convert between ints/longs and floats/doubles in their backwards compatibility system. That just seemed like an unforced error to me. (Maybe it's changed in newer versions of the standard?)
https://gist.github.com/deckar01/f77d98550eaf5d9b3a954eb0343...
Here is a visualization I made recently on the density of float32. It seems that float32 is basically just PCM, which was a lossy audio compression exploiting the fact that human hearing has logarithmic sensitivity. I’m not sure why they needed the mantissa though. If you give all 31 bits to the exponent, then normalize it to +/-2^7, you get a continuous version of the same function.
So, PCM isn't the thing you meant here. PCM just means Pulse-code modulation, you're probably thinking of a specific non-linear PCM and maybe that was even the default for some particular hardware or software you used, but that's not what PCM itself means and these days almost everything uses Linear PCM.
I think G.711 PCM is what the OP meant.
Wow. G.711 is extremely obsolete. Interpreting PCM as G.711 (which is from the 1970s) is about similar to if somebody said "Windows" but meant Windows 2.x the 1980s DOS-based Microsoft GUI. I guess I don't have to feel like I'm the oldest person reading HN.
In the telco industry, "PCM" or more precisely "PCMA" and "PCMU", refers to G.711. It's still the default fallback for VoIP applications.
> It seems that float32 is basically just PCM
float has higher accuracy around 1.0 than around 2*24. This makes it quite a bit different from PCM which is fully linear. Which is probably why floating point PCM keeps it's samples primarily between -1.0 and +1.0.
> which was a lossy audio compression
It's not lossy. Your bit depth simply defines the noise floor which is the smallest difference in volume you can represent. This may result in loss of information but at even 16 bits only the most sensitive of ears could even pretend to notice.
> If you give all 31 bits to the exponent, then normalize it to +/-2^7, you get a continuous version of the same function.
You'll extend the range but loose all the precision. This is probably the opposite of what any IEE754 user actually wants.
Here is what a 31-bit exponent 0-bit mantissa encoding looks like compared to float32:
https://gist.github.com/deckar01/3f93802329debe116b0c3570bed...
I don't have the time to fully analyze this but my concern would be here:
In the E=8 case then this is just `* 1`. In the E=31 case this is now `* 2*-23`. Python is going to do all of this for you in the float64 domain. I think it's possible that you haven't graphed what you intended.You also don't have subnormals, infinities or propagating NaNs. You manage to only retain the signed 0.
EDIT: And the midpoint of your system is 0.5. Which is a little uncomfortable.
> Which is probably why floating point PCM keeps it's samples primarily between -1.0 and +1.0.
No, it's just it's more natural/intuitive to express algorithms in a normalized range if given the possibility.
Same with floating point RGBA (like in GPUs)
Without a mantissa, way too much precision is allocated to the near zero range and not enough to the "near infinity" range. Consider that without a mantissa, the second largest float is only half of the largest float. With a 23 bit mantissa, there are 2^23 floats from half the largest to the largest.
You could change the scaling factor to target any bounds you want. On average the precision is equal. The mantissa just adds linear segments to a logarithmic curve.
> On average the precision is equal. The mantissa just adds linear segments to a logarithmic curve.
Yes, exactly; the linear regions are needed to more evenly distribute precision, while the average precision remains the same. Alternatively, you can omit the mantissa, but use an exponent base much closer to 1 (perhaps 1 + 2⁻²³).
Well yes. Given N bits:
- Most floats are not ints.
- There are the same number of floats as ints.
- Therefore, most ints are not floats.
Pedantically there are fewer floats because of NaNs and double counting of zero
I'll take the contrary position and argue that most ints are floats, because ints are not uniformly distributed. 0, 1, 10, etc. are far more common.
Far more common than what exactly? Because if you look at the range of the exponent, you should still come to the conclusion that floats can represent more non-integer numbers than integer numbers.
True but misleading. Like one of the other commenters also mentioned, most integers are small. For 32-bit you might run into integers above 2^23 bits once in a while, but for 64-bits it’s really not that common to have integers above 2^53, unless it’s a bit pattern instead of a natural number. So you could reasonably say “most integers _are_ 64-bit floats.”
A little known fact is that some intel processors use 80 bit registers to store double values, which can result in programs behaving differently based on the optimization level, because if values are stored in memory, like when building for debug, they are rounded.
But all int32 can be exactly represented by float64, which luajit uses extensively.