In the first part of this series we looked at algorithms which guaranteed their order of operations vs. algorithms that don’t. On further reflection, there are two things missing from that article.

First of all, I should have included std::iota in the list of “order guaranteed” algorithms. I overlooked this partly because std::iota doesn’t take an input range, but also because the language used in the standard isn’t entirely clear. I am reasonably confident though that operator++ must be applied starting with the first value output, and continuing in sequence. (As always, if someone has some evidence I am wrong about this please tell me).

This raises the number of algorithms that guarantee ordering to 14.

Secondly, we didn’t look at “why”. Why should some algorithms guarantee their operation order while others don’t? Let’s look at the 14 algorithms in groups, starting with std::copy and std::move:

OutputIterator | copy | (InputIterator, InputIterator, OutputIterator) |

BidirectionalIterator2 | copy_backward | (BidirectionalIterator1, BidirectionalIterator1, BidirectionalIterator2) |

OutputIterator | move | (InputIterator, InputIterator, OutputIterator) |

BidirectionalIterator2 | move_backward | (BidirectionalIterator1, BidirectionalIterator1, BidirectionalIterator2) |

Copy and move have specific forward and backward versions. This is to handle overlapping input and output ranges. In order for copying or moving between overlapping ranges to work as expected, the order must be specified. The order will be different depending on whether the destination range overlaps the beginning or the end of the source range – hence the forward and backward versions.

(There was an issue raised in 1995 about the direction of std::copy. Read it here.)

Moving on, the next algorithm is std::for_each:

Function | for_each | (InputIterator, InputIterator, Function) |

As Kreft and Langer point out here, std::for_each was unusual among the algorithms because it allows its function object to have side effects. I say ** was** because the C++98 standard includes the phrase –

*op and binary_op shall not have any side effects*– in the description of std::transform. The C++11 standard weakens that requirement.

Regardless of the changes to the wording of std::transform, once your function object can have side effects, it is possible for those side effects to vary depending on the order in which the range is processed. To get deterministic results you need a deterministic processing order, hence (at least historically) the order requirement for std::for_each.

That leaves us with 9 algorithms to explain:

T | accumulate | (InputIterator, InputIterator, T) |

T | accumulate | (InputIterator, InputIterator, T, BinaryOperation) |

OutputIterator | adjacent_difference | (InputIterator, InputIterator, OutputIterator) |

OutputIterator | adjacent_difference | (InputIterator, InputIterator, OutputIterator, BinaryOperation) |

T | inner_product | (InputIterator1, InputIterator1, InputIterator2, T) |

T | inner_product | (InputIterator1, InputIterator1, InputIterator2, T, BinaryOperation1, BinaryOperation2) |

void | iota | (ForwardIterator, ForwardIterator, T) |

OutputIterator | partial_sum | (InputIterator, InputIterator, OutputIterator) |

OutputIterator | partial_sum | (InputIterator, InputIterator, OutputIterator, BinaryOperation) |

What do these algorithms have in common?

- They are all in the header numeric. In fact, they are the entire contents of the header numeric.
- They all involve combining elements of the input range in some way (or the output range for std::iota).

So we have “in order”, “numeric”, and “combining”. “Numeric” leads us to think of floating point values, and the combination of “in order” and “floating point” leads us to the fact that floating point arithmetic is neither associative nor commutative. [Edit: See John Payson’s comment where he points out that IEEE-754 floating point *is* commutative for addition and multiplication.] Running std::accumulate from front to back does not necessarily give us the same answer as running it from back to front. Once again, if we want a deterministic answer we need a deterministic processing order, hence the “in order” requirement for these algorithms.

As usual, if the standard algorithms do not supply us with what we want, we can write our own.

The C++11 standard also makes it clear that the compiler is limited in its rearrangement of operators:

[ Note: Operators can be regrouped according to the usual mathematical rules only where the operators really are associative or commutative.

**C++11 standard** *1.9*

In case anyone thinks that floating point commutativity [Edit: I should have said associative here] only matters for very large, very small or very accurate calculations, it doesn’t. One of my home projects is a music typesetting program. It lays out music onto a typical sheet of paper (A4 or 8.5 x 11) and it doesn’t need micron accuracy, however I still got caught by floating point operations.

At one point I had two functions (A and B) that computed what was nominally the same value, but in two different ways. I sorted a vector using a comparison operator that called function A. I then attempted to do a binary search on the vector using a comparison operator that called function B. Fortunately the debug version of the STL I was using detected attempts to use a binary search on a vector that was not sorted – the vector was sorted according to A, it was not sorted according to B. Yet another reason why “don’t repeat yourself” is such a good rule.

In part 1 I mentioned the possibility of having some of the algorithms perform their calculations in parallel. The C++11 standard has this to say:

Unless otherwise specified, C++ standard library functions shall perform all operations solely within the current thread if those operations have effects that are visible (1.10) to users.

[ Note: This allows implementations to parallelize operations if there are no visible side effects. —end note ]

**C++11 standard** *17.6.5.9*

I.e. no parallelism unless it’s invisible to the user.

There are some interesting proposals for explicitly parallel algorithms. For example, see section 6.6 *Associativity/Commutativity of Binary Operators* (p66) of A Parallel Algorithms Library which points out that some parallel algorithms will require associative and commutative operators.

Parallelizing the Standard Algorithms Library also looks at the topic of parallelization. For example, see section 5.2 *std::accumulate versus thrust::reduce* for a discussion of the need for associative and commutative operators.

Several of my references in this post came from this website which contains many papers from the C++ standards committee. If you have far too much time to kill, the site contains fascinating information – current and historical.

The example given for floating-point math not being commutative merely shows that it’s not associative (since it compares (0.1+0.2)+0.3 to (0.3+0.2)+0.1. Had the comparison instead been with (0.2+0.1)+0.3, 0.3+(0.1+0.2), or 0.3+(0.2+0.1), all three of the latter expressions would equal the first, because IEEE-754 floating-point math is commutative for addition and multiplication; it is anti-commutative for subtraction [x-y will be equivalent to -1.0 * (y-x)] except in the case where operands are equivalent; such cases yield positive zero, and swapping the identical operands will still yield positive zero, which when multiplied by -1.0 will yield negative zero].

I think the nastiest gotcha with floating-point math (the one aspect of IEEE-754 that I absolutely despise) is that the relational operators are non-transitive, and neither the == nor != operator tests an equivalence relation. If one has a list that contains any copies of a few numbers and wants to find all the unique values in the list, a floating-point equality test will work just fine *unless* of the duplicated values is NaN. I wonder if anyone at IEEE considered that even numerical operations often need to maintain things like sorted lists, and that the behavior of NaN requires code to either jump through a lot of hoops to deal with NaN or else yield meaningless results (not necessarily NaN) when comparisons behave nontransitively.

Thank you for the correction John. I have updated the post, and also added this to my list of “things I thought I knew about floating point arithmetic but really didn’t”.

One article I rather liked was How Java’s Floating-Point Hurts Everyone Everywhere in which Prof. W. Kahan (one of the major forces behind IEEE-754) decries Java’s approach to floating-point math, which since the article was written seems to have spread elsewhere as well. The approach to floating-point math which Prof. Kahan advocates (I do too) is to have implementations promote all floating-point operands to values to the longest type

available to the programmer, perform computations on that type, and then round the results back to a smaller type for storage. It’s important that the intermediate-result type be available to the programmer as a variable type, so that something like:long double t=d1+d2;

double resulta = t+d3a, resultb = t+d3b;

be equivalent to:

double resulta = d1+d2+d3a, resultb = d1+d2+d3b;

Unfortunately, many C compilers, for a reason described below, used 80-bit precision on intermediate results *but* made “long double” a synonym for a 64-bit double–a horrible behavior which caused many programmers to unfairly dislike 80-bit types.

If intermediate results are performed as an 80-bit extended double, then addition of double values be associative much more often than when every intermediate computation is rounded to a 64-bit double. Further, most 16-bit or 32-bit platform which lack floating-point support (e.g. today’s embedded systems using ARM Cortex-M0 or M3 processors) perform floating-point math by converting numbers to something equivalent to extended precision, then doing the operation, and then converting the result back. When performing multiple operations in sequence, keeping the intermediate results in an extended-double type would make things not only more precise but also *faster*. A true win-win.

IMHO, while Java put the nail in the coffin for proper floating-point math, much of the fault lies with C’s “printf” statement, combined with the inability for a “…” parameter in a prototype to indicate whether the code expects floating-point values to all be passed as 64-bit, 80-bit, or a mixture of both. Originally in C the longest type was “double”, and varargs arguments promoted to that type, so given float f; printf(“%6.2f %6.2f”, f*0.1, f*0.1f ); it wasn’t necessary for printf to worry about the difference between float and double. Having all floating-point values promote to long double (and having var-args extraction expect float, double, and long double to all be passed as long double) would have worked nicely except that any compiled binaries which expected floating-point varargs to be passed as 64-bit values would be incompatible with compilers that used an 80-bit long double. IMHO, the solution would have been to allow the prototype to specify how the values should be passed, but that’s not what ANSI did.

ANSI instead decided to have float and double pass as double, and long double pass as long double; this led to all sorts of compatibility nightmares on platforms where the two types differed. The solution chosen by many compiler vendors was alas to make “long double” simply be synonymous with “double”, which fixed that problem, but ruined the semantics of precise floating-point computation.

BTW, a nice example where eager promotion can make a huge difference: use Heron’s formula to compute the area of a triangle with sides 8388607f, 8388607f, and 3f. Computing the semi-perimeter with a 24-bit mantissa will yield a value of 8388608; using a 25-bit mantissa would yield 8388608.5. Computing sqrt(s*(s-a)*(s-b)*(s-c)) with the former value yields roughly 8388606.5, which is nowhere near the correct value of approximately 12582910.5. Adding just *one bit* of precision to the intermediate computation adds 23 bits of precision to the final result.