Stackoverflow热门问题（十一）-float与double有何不同？

MWHLS • 2021/06/05 am11:17 • C, 其它, 计算机语言

已阅： 4

stackoverflow热门问题目录
如有翻译问题欢迎评论指出，谢谢。

float与double有何不同？

What is the difference between float and double?

float与double有何不同？

VaioIsBorn asked:
- 我了解了关于单精度与双精度的不同。不过在大多情况下，float与double似乎交换了也不会有影响。那么什么时候float与double是可交换的？它们又有什么不同？
Answers:
- kennytm - vote: 561
  - 很大的不同。
  - 正如名字所暗示的那般，double有着float双倍的精度^[1]。一般double有十五位的精度，float有七位。
  - 位数是这么计算的：
    - double 有着52个小数位数，以及1个隐藏位：log(2⁵³)÷log(10) = 15.95位
    - float 有着23个小数位数，以及1个隐藏位：log(2²⁴)÷log(10) = 7.22位
  - 重复计算会让精度丢失累加，而造成更大的截断误差，例如：
    - float a = 1.f / 81;
      float b = 0;
      for (int i = 0; i < 729; ++ i)
      b += a;
      printf("%.7g\n", b); // prints 9.000023
  - 而
    - double a = 1.0 / 81;
      double b = 0;
      for (int i = 0; i < 729; ++ i)
      b += a;
      printf("%.15g\n", b); // prints 8.99999999999996
  - float最大值大约为3e38，而double大约1.7e308，所以对于一些简单的计算，float比double更容易到达“无限”（即一个特殊的浮点数），例如60的阶乘。
  - 在测试中，可能有一些测试案例包括巨大的数，如果你使用float，可能会让你的程序错误。
  - 不过即便是double，有时也不够精确，所以有了long double^[1]（在Mac中，上面的例子结果为9.000000000000000066）。但所有的浮点类型都会有四舍五入的偏差，所以如果精确度非常重要（例如金额处理），最好用int或分数类
  - 此外，不要使用+=来计算浮点数的和，这样会快速累积错误。如果用的是python，使用fsum，或者Kahan summation algorithm。
  - ^{[1]：C与C++标准并未规定float、double、long double的表示形式。所以三者都可能实现IEEE的双精度。不过，对于大多数架构（gcc, MSVC, x86, x64, ARM）来说，float实际上是IEEE单精度浮点型（32位），而double是双精度浮点型（64位）。}
- Gregory Pakosz - vote: 57
  - C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8)标准：
    - 有三个浮点类型：float、double、long double。double至少提供float的精度，long double至少提供double的精度。float值的范围是double值范围的子集，double值的范围是long double值范围的子集。
  - C++标准额外规定：
    - 浮点类型的实现是编译器定义的。
  - 我建议你阅读一下What Every Computer Scientist Should Know About Floating-Point Arithmetic，它深入探究了IEEE浮点标准。你可以了解到量级与精度之间的权衡。浮点实现精度的增加会导致量级减少，因此在-1至1之间有着最高的精度。
- Alok Singhal - vote: 29
  - 对于二次等式x² − 4.0000000 x + 3.9999999 = 0，其根为r₁ = 2.000316228 ， r₂ = 1.999683772。
  - 使用float与double的测试用例：
    - #include
      #include
      void dbl_solve(double a, double b, double c){
      double d = b*b - 4.0*a*c;
      double sd = sqrt(d);
      double r1 = (-b + sd) / (2.0*a);
      double r2 = (-b - sd) / (2.0*a);
      printf("%.5f\t%.5f\n", r1, r2);
      }
      void flt_solve(float a, float b, float c){
      float d = b*b - 4.0f*a*c;
      float sd = sqrtf(d);
      float r1 = (-b + sd) / (2.0f*a);
      float r2 = (-b - sd) / (2.0f*a);
      printf("%.5f\t%.5f\n", r1, r2);
      }
      
      int main(void){
      float fa = 1.0f;
      float fb = -4.0000000f;
      float fc = 3.9999999f;
      double da = 1.0;
      double db = -4.0000000;
      double dc = 3.9999999;
      flt_solve(fa, fb, fc);
      dbl_solve(da, db, dc);
      return 0;
      }
  - 结果为：
    - 2.00000 2.00000
      2.00032 1.99968
  - 即便数字并不大，但float仍然出现了截断误差。
  - 事实上，上面使用单精度或者双精度的方式来求解二次等式，并不是最好的方式，但即便使用更妥当的方法来处理，也是同样的结果。

What is the difference between float and double?

VaioIsBorn asked:
- I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
  - 我了解了关于单精度与双精度的不同。不过在大多情况下，float与double似乎交换了也不会有影响。那么什么时候float与double是可交换的？它们又有什么不同？
Answers:
- kennytm - vote: 561
  - Huge difference.
    - 很大的不同。
  - As the name implies, a double has 2x the precision of float^[1]. In general a double has 15 decimal digits of precision, while float has 7.
    - 正如名字所暗示的那般，double有着float双倍的精度^[1]。一般double有十五位的精度，float有七位。
  - Here's how the number of digits are calculated:
    - 位数是这么计算的：
    - double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
      - double 有着52个小数位数，以及1个隐藏位：log(2⁵³)÷log(10) = 15.95位
    - float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
      - float 有着23个小数位数，以及1个隐藏位：log(2²⁴)÷log(10) = 7.22位
  - This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
    - 重复计算会让精度丢失累加，而造成更大的截断误差，例如：
    - float a = 1.f / 81;
      float b = 0;
      for (int i = 0; i < 729; ++ i)
      b += a;
      printf("%.7g\n", b); // prints 9.000023
  - while
    - 而
    - double a = 1.0 / 81;
      double b = 0;
      for (int i = 0; i < 729; ++ i)
      b += a;
      printf("%.15g\n", b); // prints 8.99999999999996
  - Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
    - float最大值大约为3e38，而double大约1.7e308，所以对于一些简单的计算，float比double更容易到达“无限”（即一个特殊的浮点数），例如60的阶乘。
  - During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
    - 在测试中，可能有一些测试案例包括巨大的数，如果你使用float，可能会让你的程序错误。
  - Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double^[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
    - 不过即便是double，有时也不够精确，所以有了long double^[1]（在Mac中，上面的例子结果为9.000000000000000066）。但所有的浮点类型都会有四舍五入的偏差，所以如果精确度非常重要（例如金额处理），最好用int或分数类
  - Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
    - 此外，不要使用+=来计算浮点数的和，这样会快速累积错误。如果用的是python，使用fsum，或者Kahan summation algorithm。
  - ^{[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).}
    - [1]：C与C++标准并未规定float、double、long double的表示形式。所以三者都可能实现IEEE的双精度。不过，对于大多数架构（gcc, MSVC, x86, x64, ARM）来说，float实际上是IEEE单精度浮点型（32位），而double是双精度浮点型（64位）。
- Gregory Pakosz - vote: 57
  - Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
    - C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8)标准：
    - There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
      - 有三个浮点类型：float、double、long double。double至少提供float的精度，long double至少提供double的精度。float值的范围是double值范围的子集，double值的范围是long double值范围的子集。
  - The C++ standard adds:
    - C++标准额外规定：
    - The value representation of floating-point types is implementation-defined.
      - 浮点类型的实现是编译器定义的。
  - I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
    - 我建议你阅读一下What Every Computer Scientist Should Know About Floating-Point Arithmetic，它深入探究了IEEE浮点标准。你可以了解到量级与精度之间的权衡。浮点实现精度的增加会导致量级减少，因此在-1至1之间有着最高的精度。
- Alok Singhal - vote: 29
  - Given a quadratic equation: x² − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r₁ = 2.000316228 and r₂ = 1.999683772.
    - 对于二次等式x² − 4.0000000 x + 3.9999999 = 0，其根为r₁ = 2.000316228 ， r₂ = 1.999683772。
  - Using float and double, we can write a test program:
    - 使用float与double的测试用例：
    - #include
      #include
      void dbl_solve(double a, double b, double c){
      double d = b*b - 4.0*a*c;
      double sd = sqrt(d);
      double r1 = (-b + sd) / (2.0*a);
      double r2 = (-b - sd) / (2.0*a);
      printf("%.5f\t%.5f\n", r1, r2);
      }
      void flt_solve(float a, float b, float c){
      float d = b*b - 4.0f*a*c;
      float sd = sqrtf(d);
      float r1 = (-b + sd) / (2.0f*a);
      float r2 = (-b - sd) / (2.0f*a);
      printf("%.5f\t%.5f\n", r1, r2);
      }
      
      int main(void){
      float fa = 1.0f;
      float fb = -4.0000000f;
      float fc = 3.9999999f;
      double da = 1.0;
      double db = -4.0000000;
      double dc = 3.9999999;
      flt_solve(fa, fb, fc);
      dbl_solve(da, db, dc);
      return 0;
      }
  - Running the program gives me:
    - 结果为：
    - 2.00000 2.00000
      2.00032 1.99968
  - Note that the numbers aren't large, but still you get cancellation effects using float.
    - 即便数字并不大，但float仍然出现了截断误差。
  - (In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
    - 事实上，上面使用单精度或者双精度的方式来求解二次等式，并不是最好的方式，但即便使用更妥当的方法来处理，也是同样的结果。