Algorithms and Variations 1

In this talk, Alex Stepanov states “Science is about classification”.

This series is going to look at different ways we can classify the standard algorithms. Classification involves finding patterns, and with any luck the patterns will reveal things that we didn’t previously know.

I’ll start by listing all of the algorithms – 144 in total:

T accumulate (InputIterator, InputIterator, T)
T accumulate (InputIterator, InputIterator, T, BinaryOperation)
OutputIterator adjacent_difference (InputIterator, InputIterator, OutputIterator)
OutputIterator adjacent_difference (InputIterator, InputIterator, OutputIterator, BinaryOperation)
ForwardIterator adjacent_find (ForwardIterator, ForwardIterator)
ForwardIterator adjacent_find (ForwardIterator, ForwardIterator, BinaryPredicate)
bool all_of (InputIterator, InputIterator, Predicate)
bool any_of (InputIterator, InputIterator, Predicate)
bool binary_search (ForwardIterator, ForwardIterator, T)
bool binary_search (ForwardIterator, ForwardIterator, T, Compare)
OutputIterator copy (InputIterator, InputIterator, OutputIterator)
BidirectionalIterator2 copy_backward (BidirectionalIterator1, BidirectionalIterator1, BidirectionalIterator2)
OutputIterator copy_if (InputIterator, InputIterator, OutputIterator, Predicate)
OutputIterator copy_n (InputIterator, Size, OutputIterator)
difference_type count (InputIterator, InputIterator, T)
difference_type count_if (InputIterator, InputIterator, Predicate)
bool equal (InputIterator1, InputIterator1, InputIterator2)
bool equal (InputIterator1, InputIterator1, InputIterator2, BinaryPredicate)
pair<ForwardIterator, ForwardIterator> equal_range (ForwardIterator, ForwardIterator, T)
pair<ForwardIterator, ForwardIterator> equal_range (ForwardIterator, ForwardIterator, T, Compare)
void fill (ForwardIterator, ForwardIterator, T)
OutputIterator fill_n (OutputIterator, Size, T)
InputIterator find (InputIterator, InputIterator, T)
ForwardIterator1 find_end (ForwardIterator1, ForwardIterator1, ForwardIterator2, ForwardIterator2)
ForwardIterator1 find_end (ForwardIterator1, ForwardIterator1, ForwardIterator2, ForwardIterator2, BinaryPredicate)
InputIterator find_first_of (InputIterator, InputIterator, ForwardIterator, ForwardIterator)
InputIterator find_first_of (InputIterator, InputIterator, ForwardIterator, ForwardIterator, BinaryPredicate)
InputIterator find_if (InputIterator, InputIterator, Predicate)
InputIterator find_if_not (InputIterator, InputIterator, Predicate)
Function for_each (InputIterator, InputIterator, Function)
void generate (ForwardIterator, ForwardIterator, Generator)
OutputIterator generate_n (OutputIterator, Size, Generator)
bool includes (InputIterator1, InputIterator1, InputIterator2, InputIterator2)
bool includes (InputIterator1, InputIterator1, InputIterator2, InputIterator2, Compare)
T inner_product (InputIterator1, InputIterator1, InputIterator2, T)
T inner_product (InputIterator1, InputIterator1, InputIterator2, T, BinaryOperation1, BinaryOperation2)
void inplace_merge (BidirectionalIterator, BidirectionalIterator, BidirectionalIterator)
void inplace_merge (BidirectionalIterator, BidirectionalIterator, BidirectionalIterator, Compare)
void iota (ForwardIterator, ForwardIterator, T)
bool is_heap (RandomAccessIterator, RandomAccessIterator)
bool is_heap (RandomAccessIterator, RandomAccessIterator, Compare)
RandomAccessIterator is_heap_until (RandomAccessIterator, RandomAccessIterator)
RandomAccessIterator is_heap_until (RandomAccessIterator, RandomAccessIterator, Compare)
bool is_partitioned (InputIterator, InputIterator, Predicate)
bool is_permutation (ForwardIterator1, ForwardIterator1, ForwardIterator2)
bool is_permutation (ForwardIterator1, ForwardIterator1, ForwardIterator2, BinaryPredicate)
bool is_sorted (ForwardIterator, ForwardIterator)
bool is_sorted (ForwardIterator, ForwardIterator, Compare)
bool is_sorted_until (ForwardIterator, ForwardIterator)
bool is_sorted_until (ForwardIterator, ForwardIterator, Compare)
void iter_swap (ForwardIterator1, ForwardIterator2)
bool lexicographical_compare (InputIterator1, InputIterator1, InputIterator2, InputIterator2)
bool lexicographical_compare (InputIterator1, InputIterator1, InputIterator2, InputIterator2, Compare)
ForwardIterator lower_bound (ForwardIterator, ForwardIterator, T)
ForwardIterator lower_bound (ForwardIterator, ForwardIterator, T, Compare)
void make_heap (RandomAccessIterator, RandomAccessIterator)
void make_heap (RandomAccessIterator, RandomAccessIterator, Compare)
const T& max (T, T)
const T& max (T, T, Compare)
T max (InitializerList)
T max (InitializerList, Compare)
ForwardIterator max_element (ForwardIterator, ForwardIterator)
ForwardIterator max_element (ForwardIterator, ForwardIterator, Compare)
OutputIterator merge (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator)
OutputIterator merge (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator, Compare)
const T& min (T, T)
const T& min (T, T, Compare)
T min (InitializerList)
T min (InitializerList, Compare)
ForwardIterator min_element (ForwardIterator, ForwardIterator)
ForwardIterator min_element (ForwardIterator, ForwardIterator, Compare)
pair<const T&, const T&> minmax (T, T)
pair<const T&, const T&> minmax (T, T, Compare)
pair<T, T> minmax (InitializerList)
pair<T, T> minmax (InitializerList, Compare)
pair<ForwardIterator, ForwardIterator> minmax_element (ForwardIterator, ForwardIterator)
pair<ForwardIterator, ForwardIterator> minmax_element (ForwardIterator, ForwardIterator, Compare)
pair<InputIterator1, InputIterator2> mismatch (InputIterator1, InputIterator1, InputIterator2)
pair<InputIterator1, InputIterator2> mismatch (InputIterator1, InputIterator1, InputIterator2, BinaryPredicate)
OutputIterator move (InputIterator, InputIterator, OutputIterator)
BidirectionalIterator2 move_backward (BidirectionalIterator1, BidirectionalIterator1, BidirectionalIterator2)
bool next_permutation (BidirectionalIterator, BidirectionalIterator)
bool next_permutation (BidirectionalIterator, BidirectionalIterator, Compare)
bool none_of (InputIterator, InputIterator, Predicate)
void nth_element (RandomAccessIterator, RandomAccessIterator, RandomAccessIterator)
void nth_element (RandomAccessIterator, RandomAccessIterator, RandomAccessIterator, Compare)
void partial_sort (RandomAccessIterator, RandomAccessIterator, RandomAccessIterator)
void partial_sort (RandomAccessIterator, RandomAccessIterator, RandomAccessIterator, Compare)
RandomAccessIterator partial_sort_copy (InputIterator, InputIterator, RandomAccessIterator, RandomAccessIterator)
RandomAccessIterator partial_sort_copy (InputIterator, InputIterator, RandomAccessIterator, RandomAccessIterator, Compare)
OutputIterator partial_sum (InputIterator, InputIterator, OutputIterator)
OutputIterator partial_sum (InputIterator, InputIterator, OutputIterator, BinaryOperation)
ForwardIterator partition (ForwardIterator, ForwardIterator, Predicate)
pair<OutputIterator1, OutputIterator2> partition_copy (InputIterator, InputIterator, OutputIterator1, OutputIterator2, Predicate)
ForwardIterator partition_point (ForwardIterator, ForwardIterator, Predicate)
void pop_heap (RandomAccessIterator, RandomAccessIterator)
void pop_heap (RandomAccessIterator, RandomAccessIterator, Compare)
bool prev_permutation (BidirectionalIterator, BidirectionalIterator)
bool prev_permutation (BidirectionalIterator, BidirectionalIterator, Compare)
void push_heap (RandomAccessIterator, RandomAccessIterator)
void push_heap (RandomAccessIterator, RandomAccessIterator, Compare)
void random_shuffle (RandomAccessIterator, RandomAccessIterator)
void random_shuffle (RandomAccessIterator, RandomAccessIterator, RandomNumberGeneratorenerator)
ForwardIterator remove (ForwardIterator, ForwardIterator, T)
OutputIterator remove_copy (InputIterator, InputIterator, OutputIterator, T)
OutputIterator remove_copy_if (InputIterator, InputIterator, OutputIterator, Predicate)
ForwardIterator remove_if (ForwardIterator, ForwardIterator, Predicate)
void replace (ForwardIterator, ForwardIterator, T, T)
OutputIterator replace_copy (InputIterator, InputIterator, OutputIterator, T, T)
OutputIterator replace_copy_if (InputIterator, InputIterator, OutputIterator, Predicate, T)
void replace_if (ForwardIterator, ForwardIterator, Predicate, T)
void reverse (BidirectionalIterator, BidirectionalIterator)
OutputIterator reverse_copy (BidirectionalIterator, BidirectionalIterator, OutputIterator)
ForwardIterator rotate (ForwardIterator, ForwardIterator, ForwardIterator)
OutputIterator rotate_copy (ForwardIterator, ForwardIterator, ForwardIterator, OutputIterator)
ForwardIterator1 search (ForwardIterator1, ForwardIterator1, ForwardIterator2, ForwardIterator2)
ForwardIterator1 search (ForwardIterator1, ForwardIterator1, ForwardIterator2, ForwardIterator2, BinaryPredicate)
ForwardIterator search_n (ForwardIterator, ForwardIterator, Size, T)
ForwardIterator search_n (ForwardIterator, ForwardIterator, Size, T, BinaryPredicate)
OutputIterator set_difference (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator)
OutputIterator set_difference (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator, Compare)
OutputIterator set_intersection (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator)
OutputIterator set_intersection (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator, Compare)
OutputIterator set_symmetric_difference (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator)
OutputIterator set_symmetric_difference (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator, Compare)
OutputIterator set_union (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator)
OutputIterator set_union (InputIterator1, InputIterator1, InputIterator2, InputIterator2, OutputIterator, Compare)
void shuffle (RandomAccessIterator, RandomAccessIterator, URandomNumberGeneratorenerator)
void sort (RandomAccessIterator, RandomAccessIterator)
void sort (RandomAccessIterator, RandomAccessIterator, Compare)
void sort_heap (RandomAccessIterator, RandomAccessIterator)
void sort_heap (RandomAccessIterator, RandomAccessIterator, Compare)
BidirectionalIterator stable_partition (BidirectionalIterator, BidirectionalIterator, Predicate)
void stable_sort (RandomAccessIterator, RandomAccessIterator)
void stable_sort (RandomAccessIterator, RandomAccessIterator, Compare)
ForwardIterator2 swap_ranges (ForwardIterator1, ForwardIterator1, ForwardIterator2)
OutputIterator transform (InputIterator, InputIterator, OutputIterator, UnaryOperation)
OutputIterator transform (InputIterator1, InputIterator1, InputIterator2, OutputIterator, UnaryOperation)
ForwardIterator unique (ForwardIterator, ForwardIterator)
ForwardIterator unique (ForwardIterator, ForwardIterator, BinaryPredicate)
OutputIterator unique_copy (InputIterator, InputIterator, OutputIterator)
OutputIterator unique_copy (InputIterator, InputIterator, OutputIterator, BinaryPredicate)
ForwardIterator upper_bound (ForwardIterator, ForwardIterator, T)
ForwardIterator upper_bound (ForwardIterator, ForwardIterator, T, Compare)

The standard already classifies the algorithms in various ways – e.g. those included in the header <algorithm> vs. the header <numeric> or mutating vs. non-mutating algorithms. I am sure that we can find some more classifications that are interesting.

Our first classification will be a simple one – algorithms that guarantee that the input elements will be operated on in a particular order vs. algorithms that do not make that guarantee. Actually, this ended up not being as simple as I thought which makes it even more instructive.

All we need to do is search the standard for the phrases “and proceeding to” and “in order”. It turns out that there are 13 algorithms that guarantee their order of operation:

T accumulate (InputIterator, InputIterator, T)
T accumulate (InputIterator, InputIterator, T, BinaryOperation)
OutputIterator adjacent_difference (InputIterator, InputIterator, OutputIterator)
OutputIterator adjacent_difference (InputIterator, InputIterator, OutputIterator, BinaryOperation)
OutputIterator copy (InputIterator, InputIterator, OutputIterator)
BidirectionalIterator2 copy_backward (BidirectionalIterator1, BidirectionalIterator1, BidirectionalIterator2)
Function for_each (InputIterator, InputIterator, Function)
T inner_product (InputIterator1, InputIterator1, InputIterator2, T)
T inner_product (InputIterator1, InputIterator1, InputIterator2, T, BinaryOperation1, BinaryOperation2)
OutputIterator move (InputIterator, InputIterator, OutputIterator)
BidirectionalIterator2 move_backward (BidirectionalIterator1, BidirectionalIterator1, BidirectionalIterator2)
OutputIterator partial_sum (InputIterator, InputIterator, OutputIterator)
OutputIterator partial_sum (InputIterator, InputIterator, OutputIterator, BinaryOperation)

[ Edit: I should have included std::iota in this list. See my follow-up post. ]

No other algorithms guarantee the operation order. Of course there are some algorithms where we wouldn’t expect a guaranteed order – we want std::lower_bound to use whatever order is most efficient. What about an algorithm like std::find though? Since that returns the first occurrence of a value surely it must access its input range from beginning to end. The standard does not guarantee it. I can’t think of a sensible algorithm that would do anything other than go from begin to end (unless we get into parallelism which opens up another can of worms), but the standard does not specifically make that guarantee.

Here’s another example that does not guarantee the operation order – std::transform. At first glance, that seems to make sense. Take this example:

std::vector< int > v0( { 1, 2, 3, 4, 5 } );
std::vector< int > v1( 5 );

std::transform( 
    std::begin( v0 ), 
    std::end( v0 ), 
    std::begin( v1 ), 
    []( int i )
{
    return i + 1;
} );

There are five results, each of which is computed entirely independently of the others. We know that the operation supplied to std::transform should not have any side effects so it doesn’t matter what order the results are computed in, they might even be computed in parallel. It doesn’t matter whether the result in position [0] is written before or after the result in position [1] (the results have to be written to the correct position but they don’t have to be written in a particular order).

Now let’s change the problem slightly. In this case we won’t pre-allocate the output vector and we’ll use std::back_inserter:

std::vector< int > v0( { 1, 2, 3, 4, 5 } );
std::vector< int > v1;

std::transform( 
    std::begin( v0 ), 
    std::end( v0 ), 
    std::back_inserter( v1 ), 
    []( int i )
{
    return i + 1;
} );

Now it matters what order the results are written in – the order they are written in defines the position they are written to. Doesn’t this conflict with the lack of guarantee about the order of computation?

Given that std::back_inserter is a common idiom it’s obvious that the answer to the previous question is “no, there is no conflict”. It isn’t so obvious why there is no conflict.

Let’s go back to the first example I showed:

std::vector< int > v0( { 1, 2, 3, 4, 5 } );
std::vector< int > v1( 5 );

std::transform( 
    std::begin( v0 ), 
    std::end( v0 ), 
    std::begin( v1 ), 
    []( int i )
{
    return i + 1;
} );

and the declaration of the single input range version of std::transform:

OutputIterator transform (InputIterator, InputIterator, OutputIterator, UnaryOperation)

My example is using std::vector – that gives us random access iterators that let us jump around all over the place. If you look at the declaration of std::transform though, it will work with input and output iterators. Input and output iterators do not allow random access – they don’t even allow you to iterate over the same range twice. Even though the algorithm will allow arbitrary access, the iterators don’t.

A quick look at the standard tells us that std::back_inserter returns std::back_insert_iterator which is an output iterator. If the destination of a std::transform call is an output iterator, the results must be written in the correct order. Internally they can still be calculated in any order (possible, even with input iterators, but unlikely to be efficient), but they must be written in the correct order.

So it is possible for a standard-conforming std::transform implementation to access the input elements in different orders depending on the type of iterators being used. There is definitely precedent for different algorithm implementations being used for different iterator types. Microsoft’s Stephan T. Lavavej demonstrates how template metaprogramnming can achieve this here.

However, in practice, I have never come across an implementation of std::transform that does anything other than iterate from beginning to end and I don’t expect to – at least not under the name std::transform. When I have been working with parallel code the parallelism has always been explicitly called out – I really don’t want the STL kicking off separate threads unless I tell it to. Intel’s TBB contains several explicitly parallel algorithms, but they all have the word “parallel” in their names.

In conclusion, I started off with what I though was going to be a simple classification. I went through a phase where I thought that it wasn’t so simple after all, then finally ended up at the simple place I started at – but perhaps a little wiser. I haven’t been able to find a lot of information about the access order of algorithms online, but I do recommend this 2001 DDJ article on the subject.

Quote of the week – teaching

Not exactly a quote – I don’t have the exact wording and I am not certain who said it (I am separated from the original by at least two degrees) – but it makes an important enough point that I think it’s worth writing about.

A well designed linear algebra library should teach the underlying mathematics of linear algebra.

Attributed to Jim Blinn

Jim Blinn is a big name in the computer graphics world. I heard this quote in the context of computer graphics (which uses a lot of linear algebra) but you can substitute anything you’d write a library for e.g. statistics, databases, containers, date and time calculations. I don’t know exactly what features of a library Jim Blinn had in mind, but that isn’t going to stop me speculating.

Why might it be important to follow the advice in the quote?

  • If you already know about the subject it will be easier to start using the library.
  • If you don’t know about the subject (but have to use the library) you’ll learn about the subject, and can also supplement that learning with other teaching material.
  • If you are writing a library, this advice gives you an additional design tool.

Moving on, what features of a library help teach the underlying theory? I can think of a few, I am sure there are many more – please add to this list in the comments.

  • Nomenclature. Naming comes up over and over again as an important topic. If there are already standard names for things, use them in your library.
  • Clarity of data. Is it obvious what data is required for a given function (or class or module) to do it’s job? I often find that there’s more data available than is required, and as a newcomer digging through the library it can take me some time to work out what is really needed. Simple functions that take exactly the data they require and return exactly the thing to be calculated are great. The STL algorithms do this – they only take the values that are necessary for them to do their job, and they are very specific about the types of iterators required.
  • No short cuts. In a 3D graphics library it is very tempting to make points and vectors the same thing. They both have x, y, z values, they have a fair number of operations in common and the ones that aren’t strictly in common usually have an obvious and reasonable meaning for them both. However, points and vectors are not the same thing. This is an example where using the type system helps.

Quote of the week – testing

Testing by itself does not improve software quality. Test results are an indicator of quality, but in and of themselves, they don’t improve it. Trying to improve software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software development techniques you use determine how many errors testing will find. If you want to lose weight, don’t buy a new scale; change your diet. If you want to improve your software, don’t test more; develop better.

Steve McConnell Code Complete: A Practical Handbook of Software Construction

This is a great quote, and one that’s had a lot of impact on me. I have often seen teams complain about not having enough time (and I have made that complaint many times myself). When you ask what the teams would do given more time the answer is usually “more testing”. As McConnell points out, more testing isn’t going to help if your software is badly written.

Over the past 15 years or so, two testing techniques have come into more common use – test driven development and the agile practice of having fast unit tests that are run every few minutes. These straddle the border between testing and development techniques – to use the weight loss analogy, it’s like being able to weigh yourself every few minutes throughout the day and get instant feedback on how the decisions you’re making (eat an apple vs. eat a cupcake; sit on the couch vs. go for a walk) will affect your weight (and yes, this is pushing the analogy too far).

For more Steve McConnell goodness take a look at this interview from 2003 where he talks about his career, how he writes books and the challenges of running a business. There’s also a great post on his blog where he puts his estimating techniques to the test while building a fort for his children.

Quote of the week – Atomic Accidents

fissionable materials always seemed capable of finding a flaw in the best intentions.

James Mahaffey Atomic Accidents

I have just finished reading Atomic Accidents – it’s a fascinating and frightening book. Don’t read it if your sleep is easily disturbed (I mean it – I have started having dreams where critical quantities of uranium are about to come together and I cannot stop them).

This post is going to draw some analogies between nuclear engineering and software engineering. Respecting Martin Fowler’s quote about analogies (discussed here) I am going to try to extract questions from the comparison, not answers. In any case, many of the answers from nuclear engineering are unlikely to help us in the software field. For example (off topic fact of the day), the shape of the vessel in which radioactive material is stored has a big impact on whether it will go critical or not. For safety you want as much surface area per unit volume as possible. Long, skinny tubes are safe, spheres and scaled up tomato soup cans are not (food cans are designed to use as little tin as possible compared to the volume that they hold). Given the contents of some office coffee machines you might want to stop using those nice cylindrical coffee mugs.

There are a number of themes that run through the book. I have picked out a few for discussion:

  1. An impossible situation. Something that could not possibly happen did. In line with Murphy’s Law, any container of the wrong shape will inevitably end up containing fissionable material, regardless of how many safety precautions are in place. In one accident, the container in question was a cooking pot from the kitchen that was not physically connected to anything else in the system.
  2. Complex systems. Complexity is mentioned often in the book, and never in a good way. Complexity kills, quite literally in the case of nuclear engineering. The more complex the system the more difficult it is to predict how it will react.
  3. Things are (unknowingly) going wrong even in apparently normal operation. The system appears to be working fine, however an excess of material is collecting in an unexpected place, or some amount of coolant is being lost. None of this matters until the moment it does.
  4. A lack of visibility into what is going wrong. Many of the accidents were made worse because the operators did not have a clear picture of what was happening. Or they did have a clear picture but didn’t believe it or understand it.
  5. Lessons are not always learned.

All of these apply in the context of software:

  1. An impossible situation. I have often seen an engineer (sometimes me) state that something “cannot be happening”, despite the fact that it is quite plainly happening. My rule is that I get to say “that’s impossible” once, then I move on, deal with reality and fix the problem.
  2. Complex systems. We have talked about complexity on this blog before. As if complexity wasn’t bad enough on its own it makes every other problem far worse.
  3. Things are going wrong even in normal looking operation. Every time I dive into a system that I believe is working perfectly I discover something happening that I didn’t expect. I might be running a debugger and see that some pointer is NULL when it shouldn’t be, or looking at a log file showing that function B is called before function A despite that being incorrect, or studying profiler output demonstrating that some function is called 1,000 times when it should only be called once. In Writing Solid Code, Steve Maguire suggests stepping through all new code in the debugger to ensure that it is doing the right thing. There is a difference between “looking like it’s working” and “actually working”.
  4. A lack of visibility into what is going wrong. Modern software systems are complicated and have many moving parts. Imagine a video delivery system that runs over the internet with a web front end, a database back end and a selection of different codecs used for different browsers on different devices. There are many places for things to go wrong, and too often we lack the tools to drill down to the core of the problem.
  5. Lessons are not always learned. I could rant about this for a very long time. At some point I probably will but the full rant deserves a post of its own. I have seen far too many instances in this industry of willful ignorance about best practices and the accumulated knowledge of our predecessors and experts. We’re constantly re-inventing the wheel and we’re not even making it round.

I am quite happy that I do not work on safety critical systems, although I did once write software to control a robot that was 12′ tall, weighed 1,000 pounds and was holding a lit welding torch. I stood well back when we turned it on.

In conclusion, one more quote from the book:

A safety disk blew open, and sodium started oozing out a relief vent, hit the air in the reactor building, and made a ghastly mess.

No one was hurt. When such a complicated system is built using so many new ideas and mechanisms, there will be unexpected turns, and this was one of them. The reactor was in a double-hulled stainless steel container, and it and the entire sodium loop were encased in a domed metal building, designed to remain sealed if a 500-pound box of TNT were exploded on the main floor. It was honestly felt that Detroit was not in danger, no matter what happened.

Atomic Accidents. Read it, unless you’re of a nervous disposition.

Quote of the week – consistency

A good architecture is consistent in the sense that, given a partial knowledge of the system, one can predict the remainder.

Fred Brooks The Design of Design

Fred Brooks is well known for The Mythical Man-Month, but his newer book The Design of Design is also worth reading. He takes general design principles and applies them to computer software, computer hardware, buildings large and small, aeroplanes, bridges and more.

One of the main themes of The Mythical Man-Month is conceptual integrity. He carries this drumbeat over to The Design of Design and explores it some more. For my own part, there is a joy in being able to predict where something is in a system, or what it’s called, because of the consistency running through the system.

Auto

I am deeply suspicious of auto and decltype. I acknowledge that I am on the opposite side of the argument to people such as Bjarne Stroustrup, Herb Sutter and Scott Meyers. This is not a comfortable place to be. I picked these three luminaries because it is easy to find quotes from them – I am sure there are plenty of other people who agree with them.

Unless there is a good reason not to, use auto in small scopes

Bjarne Stroustrup The C++ Programming Language 4th edition, 6.3.6.1

Almost Always Auto

Herb Sutter Almost Always Auto

Prefer auto to explicit type declarations

Scott Meyers Effective Modern C++ (Early release) Item 5

This is not an attempt to trash any of these authors – I have quoted their conclusions however, they have well thought through reasons backing up those conclusions. I highly recommend reading the entire pieces that I took these quotes from – as always with Stroustrup, Sutter and Meyers, even when I disagree with their conclusions (which isn’t that often) I learn something from their arguments. All three authors talk about the problems that can arise from auto – note that even in the quotes I have picked there are phrases like “unless there is a good reason not to”, “almost always” and “prefer”.

I think my main disagreement is with the weighting given to some of the disadvantages. My objection to auto can be summarized as “makes code easier to write, makes it harder to read”, and I weight this very highly, as someone who has spent a fair amount of his career trying to get up to speed with large codebases written by someone else, and trying to get new team members up to speed with large codebases written partially by me.

For example, I joined a team with a large, pre-existing, pre-C++11 codebase that used compiler extensions to fake auto and decltype. Learning the codebase was made far more difficult by the absence of explicitly named types in many functions. The code would not have compiled unless all the types were compatible, that’s good, however I want to know more than that. I want to know what the types are so that I can go and look them up and find out what else they can do for me. I want to look at a function and know what types are the same and what aren’t. Sooner or later I am going to make changes to that function – the changes I can make will be constrained by what the types let me do (and if the types constrain me too much I am going to have to either extend the types or use different types). I am saying “types” over and over again – they’re a big deal.

I want to know what type everything is, and I weight that desire very highly.

There will be IDEs which will make it easy to discover the types being deduced, however as Scott Meyers acknowledges in Item 4 of “Effective Modern C++” they don’t do a perfect job (although I assume that will improve), and not all of us are in a position to use IDEs anyway.

Type deduction is not new in C++ – we’ve been using templates for years, and I have fewer objections to using templates. I am trying to work out why I find templates more acceptable than auto.

Reason #1 has to be that I am more used to templates than I am to auto. It takes me time to get used to a new feature. However, there is more to it than that.

Template types still have a name. Imagine this introduction to a template class:

template< typename ITERATOR >

Within the class we don’t know exactly what the type of ITERATOR is, however the name gives us a strong hint that it should be some sort of iterator.

The fact that the template types have a name also lets us see what relation (if any) there is between types in a piece of code. Contrast this:

auto i( someFunction() );
auto j( someOtherFunction() );

with

ITERATOR i( someFunction() );
ITERATOR j( someOtherFunction() );

The template version makes it clear that i and j are the same type.

On the one hand I dislike the argument “this feature can be misused so we shouldn’t ever use it”, on the other hand I have already seen this feature be massively misused, and I suspect that it will be misused more often than not. Herb Sutter writes:

Remember that preferring auto variables is motivated primarily by correctness, performance, maintainability, and robustness – and only lastly about typing convenience.

Herb Sutter Almost Always Auto

I think this is fine advice, I think that everyone in the world should stick to it. I do not believe that they will. In my experience (and I really hope that other people have different experiences), saving some typing often outweighs everything else. Delayed gratification is an unknown concept in many programming shops.

With more experience I might come around to auto. There are certainly people who have thought about this more deeply than I have who are in favour of auto. There are also definitely places where auto will help in template code. I know I dismissed the “reduces typing” argument earlier, but it would be really nice to avoid have to type typename quite so often.


A little more follow up on the team using the compiler extension versions of auto and decltype. Their use of auto was problematic, it was also exacerbated by a number of other factors. The team loved macros and loved token pasting (##). There were many classes whose declaration and definition were created using macros, and the name of the class was generated using token pasting. Combine that with their use of auto and there were classes whose full name never appeared in the code. They were impossible to grep for and I doubt that there were many IDEs (which that team wasn’t using anyway) that would have helped. I did not last very long on that team.

Quote of the week

I have always found that plans are useless, but planning is indispensable.

Dwight Eisenhower

Eisenhower was making this comment in a military context, and it fits in well with Helmuth von Moltke’s much earlier quote:

No plan of battle ever survives contact with the enemy.

If we remove the quotes from their military context (although I have encountered plenty of people in my career who can legitimately lay claim to being “the enemy”), they talk about the usefulness of the planning process itself. A good project plan is rarely “useless”, but the planning that went into it should have brought many different aspects of the problem into play, and should have led to a far greater understanding of the problem space. This understanding helps when the plan makes contact with “the enemy” and has to be rethought.

I refer back to this post where the answer to the question “what was the most useful information you recorded during those design meetings?” was “the reasons why we didn’t do something”.

The two quotes also show the differences between “traditional” design processes and Agile design processes. Up front planning absolutely has its uses, but the plan inevitably has to change when you start implementing it.

Google’s C++ style guide – follow up

My recent post about Google’s C++ style guide attracted some great comments. Billy O’Neal corrected my ignorance about C++ formatting tools by alerting me to the existence of clang-format; John Payson came up with a way of getting piecemeal from the “no exceptions allowed” state to the “exceptions allowed and all code able to handle them” state, and Titus Winters (a maintainer of the Google style guide) mentioned a talk he gave at CppCon on the subject.

The talk is on YouTube here, and is worth watching in its entirety, but here are a few highlights I took away from it:

1:42 “How should we format our code? BORING QUESTION!”. Titus introduces clang-format.
3:14 What is the purpose of a style guide?
5:15 Google’s publicly available style guide is “probably optimizing for a different set of values than your organization needs”
6:48 Google has an indexer – Kythe – which should be available as open source soon.
9:39 Google plan to be around for at least 10 years. Almost everywhere I have worked has assumed that they are going to be around for 10 years (and several of those places went bust a long time short of 10 years), but very few of my employers have made long term decisions about their code – they might have assumed that they will be around for 10 years but on the code side they certainly didn’t plan for it.
10:55 The philosophies of the style guide.
14:45 A future version of the style guide will allow streams.
28:10 “No non-const references as function arguments” is one of the most controversial rules in the style guide. I am not surprised by this, but I am disappointed. The arguments in favour of no non-const references always seemed good to me.
38:25 “I don’t care about throwing out of memory exceptions”. There are certainly products where an out of memory exception just translates to “die instantly”, however that isn’t universally true. I have never worked on a project that was able to recover and continue after running out of memory but I have worked on products that have wanted to know about the out of memory case so they can save the data the user is working on. It’s just another reminder that guidelines depend on the products they are being used for.
57:20 Probably the best answer to any formatting question that I have heard.
Q: What about tabs vs. spaces? What’s Google’s take on that?
A: I don’t care … I just use clang-format

The theme of the day is clang-format. I spent some time over the weekend playing around with clang-format and it is excellent. I haven’t worked out how to get it to do quite everything I want, however the consistency benefits far outweigh any minor deviations from my preferred style. Also, if it bothers me that much I have the source code and can add more options.

On a related note, this is the first time I have been able to get Clang and LLVM (mostly) compiling on Windows without spending a ridiculous amount of time on it. Windows has long been the afterthought platform for LLVM and Clang – it looks like it’s getting better.

Google’s C++ Style Guide

I read through Google’s C++ style guide recently and found some interesting recommendations. I am trying to avoid saying “right” or “wrong” – there are plenty of things in a style guide that are matters of opinion or are tailored to the needs of a specific company – so I am focusing on things I find interesting.

The direct link to the guide is http://google-styleguide.googlecode.com/svn/trunk/cppguide.html.

The guide is stored in a subversion repository which can be accessed at
http://google-styleguide.googlecode.com/svn

I am looking at revision 138 in the repository, checked in on 8 September 2014. At the very bottom of the document there’s a version number – 4.45.

I have no insider knowledge apart from one conversation with a Google engineer who told me that the “no exceptions” rule was the most controversial part of the guide.

It is remarkable that Google have a company-wide C++ style guide. I have worked in smaller companies (16 people, roughly half of whom were engineers) where we could not agree on anything code related. In my career, style guides (aka coding guidelines or coding standards) have created more divisive arguments than almost anything else, and enforcing them is the quickest way to upset everyone.

I do like the fact that many of the entries include pros, cons and a discussion of why a particular solution was chosen. As I have said before on this blog, even if you believe that you have made the right choice you should know what the drawbacks of that choice are.

The guide also talks a lot about consistency. I like consistency. There are times where the choices are almost arbitrary, however, it is much nicer not to have the code randomly select different choices.

Let’s start with the most controversial guideline.


Exceptions

We do not use C++ exceptions.

At least the guideline is clear. The writers of the guide also admit:

Things would probably be different if we had to do it all over again from scratch.

Taking away exceptions removes a very powerful and useful way of handling errors. It reduces the need to have error returns everywhere (which are rarely checked). Using exceptions does require your code to be written in an exception-safe manner, however that exception safe manner often has a bunch of other benefits – RAII helps with early returns and means you can’t forget the “clean up” on one code path. Smart pointers are a great way to manage lifetime (and smart pointers are encouraged by the style guide). I have worked at a number of places that banned exceptions for a variety of reasons, some good, some bad. It is possible to write decent C++ code without exceptions, but their absence does make life harder.

I have two major problems though:

  1. According to the given reasoning, there will never be a time when exceptions are allowed in Google’s code. The guideline states:

     
    Because most existing C++ code at Google is not prepared to deal with exceptions, it is comparatively difficult to adopt new code that generates exceptions.

    If the guideline is followed there will be even more code at Google that does not deal with exceptions making them even less likely to be allowed in the future. There is no way to get from here (no exceptions) to there (exceptions allowed). It is one thing to ban a feature because its current use causes problems, but for there to be no path to unbanning the feature is short-sighted in the extreme.

    You can’t even state that all new code must be exception safe. If you want code to be exception safe it has to be fully tested in the presence of exceptions. That would involve an additional build target with different settings, different versions of libraries and some additional specific tests for exception handling. That build is never going to exist.

  2. The knock on effects are horrible, in particular the effect on constructors.

Doing Work in Constructors

Avoid doing complex initialization in constructors.

First of all, the prohibition on calling virtual functions in constructors is perfectly reasonable. It’s one of those things that fall into the “don’t do that” category. The few times I have thought that I needed to do it were caused by errors in my design.

Global variables with complex constructors also fall into the “don’t do that” category, or at least “don’t do that unless you really know what you’re doing”. C++ is a harsh language if you don’t know what you’re doing. It’s a harsh language even if you do know what you’re doing.

Those are the points where I agree with Google, let’s look at the disagreements.

One of the joys of good constructors is that once the constructor has finished running you have a fully initialized object in a well-defined state that is ready to be used. That’s a simple rule and it works in many different creation scenarios. We can create the object on the stack:

SomeClass someObject( arg0, arg1, arg2 );

We can create it on the heap:

SomeClass* pSomeObject( new SomeClass( arg0, arg1, arg2 ) );

We can create the object as a member variable of another class:

SomeOtherClass::SomeOtherClass( some arguments )
: someObject_( arg0, arg1, arg2 )
{
}

We can create a temporary object for passing to a function:

someFunction( SomeClass( arg0, arg1, arg2 ) );

Two stage construction makes all of these examples longer, more fragile and less maintainable. Creating a temporary object is basically not possible – you have to create a named object then call init before passing it to a function (yes we can wrap construction and init inside a create function but why should we have to?)

The style guide states:

If the work fails, we now have an object whose initialization code failed, so it may be an [sic] indeterminate state.

With Google’s suggested “solution” (a separate init function) once the constructor has finished you have an object in an indeterminate (or at least not-fully-initialized) state. You then have to call the init function and you have to check any error return value from the init function (Note – have to. The call and the error check are not optional and you’d better not forget them or allow any future modifications to the code to accidentally delete them, or even move them too far from the original construction). Initializing const members is far more problematic with two stage construction.

In my career I have rarely had a problem with errors in constructors, although Google are certainly dealing with problems of scale that I have never encountered. Even so, I believe that this rule is hurting Google, not helping them.


Names and Order of Includes

Having praised Google for their use of pros and cons they let me down here. Google’s guideline is to start with the related header (most specific), then to jump to the least specific headers (C and C++ library includes), then gradually move to more specific headers.

The only rationale I have seen for a particular include order chooses the opposite order to Google – start with most specific, then move to less specific so that you finish with the C and C++ library headers. The rationale is to make sure that your own headers include all of the standard system files that they require – if you order the includes the other way it is easier to hide incorrect includes (or rather, lack of includes). Also, if for some reason you decided to #define linear_congruential( i ) ( i * 2 ) (or define any other symbol in the standard libraries), having the standard libraries last will almost certainly produce an error. Of course you shouldn’t be using macros in the first place, but for some inexplicable reason, people do.

In practice, I haven’t had many problems that are related to the order of includes (and when I have the problem has nearly always been that an include file wasn’t including all of the headers it required).

I do like their suggestion for alphabetical order within each section. I once worked with a developer who was diligent enough to maintain all the functions in his .cpp files in alphabetical order. I know that broke a bunch of guidelines about keeping related functions together, but damn, it was easy to find a function.


Function Names

Regular functions have mixed case; accessors and mutators match the name of the variable: MyExcitingFunction(), MyExcitingMethod(), my_exciting_member_variable(), set_my_exciting_member_variable().

I said at the beginning that I was going to try to avoid using words like “wrong”, but my tolerance has run out. This guideline is wrong.

How is it wrong? Let me count the ways:

  1. It breaks encapsulation. The name of the member function tells you something about the internal implementation.
  2. If the internal implementation changes so that the value is no longer stored in a member variable are you going to change the accessor / mutation functions and everywhere in the code that calls them? (The answer to that question is almost certainly “no”)
  3. Assume two classes exist, both of which have a concept of some number of files that they know about. One class stores the number of files in a member variable (so will have a function called num_files()), the other class calculates the number of files when necessary (so has a function called NumFiles()). You now have a template class that needs to be instantiated with a class that can tell you how many files it knows about. You cannot write a single template class that will work with both num_files() and NumFiles().
  4. In practice I am sure that if any of the first three points cause problems the name of the accessor will be changed one way or the other. There is a deeper problem though. If you’re thinking about the class in terms of accessors and mutators (or getters and setters) you are fundamentally thinking about the design incorrectly.

    Allen Holub’s provocatively named piece “Why getter and setter methods are evil” (and if anyone is supposed to not be evil it’s Google) is well worth reading. He makes some points that are beyond the scope of this post (“Don’t ask for the information you need to do the work; ask the object that has the information to do the work for you” has some interesting consequences that I am not wild about) , but he makes a killer point about “adding unnecessary capabilities to the classes”. He talks in terms of the capabilities of a class – he does that in other places as well (http://www.holub.com/goodies/rules.html), and in his book “Holub on Patterns: Learning Design Patterns by Looking at Code” where he states:

    An object is a bundle of capabilities … an object is defined by what it can do, not how it does it.

    To go back to our example, is it reasonable for our file handling class to be able to tell you how many files it knows about? If the answer to that question is “yes”, then it is quite reasonable to add a NumFiles() method. One of the capabilities of the class is that it can answer the question “how many files do you know about”. If, for some reason, (and this is a made up example which means I don’t have a good reason) it is reasonable for the class to be told that the number of files has changed, it is reasonable to have a NumFiles( int ) method. The class has the capability to accept a notification that the number of files has changed. As a result of having this information this task can perform its functions correctly. None of this says anything about the implementation of those capabilities.

In practice, I am sure that if #2 or #3 turned out to be problems the guideline will be ignored and the same name for the accessor function would be used in both places. Even if the guideline were changed to preserve encapsulation, #4 will probably still be ignored. Most development shops I have worked in do not think about class design in terms of capabilities. Google are no worse than everyone else, on the other hand, it would be nice if they were better.

For more on the subject of being evil, see the C++ FAQ.


Casting

Use C++ casts like static_cast<>(). Do not use other cast formats like int y = (int)x; or int y = int(x);

I am in complete agreement. I just want to add that in the “cons” section they state:

The syntax is nasty

Yes. The syntax is nasty. That was by design. To quote Bjarne Stroustrup from his FAQ:

An ugly operation should have an ugly syntactic form. That observation was part of the reason for choosing the syntax for the new-style casts. A further reason was for the new-style casts to match the template notation, so that programmers can write their own casts, especially run-time checked casts.

Maybe, because static_cast is so ugly and so relatively hard to type, you’re more likely to think twice before using one? That would be good, because casts really are mostly avoidable in modern C++.


Inline Functions

I am disappointed that Google don’t suggest separating inline declarations from inline definitions. C++ requires separate declarations because it was required to work with the C compilation model (C# for example does not require separate declarations). The C++ model involves repeating information – in general that’s a Bad Thing, however having declarations in a separate header file allows a class declaration to serve as effective documentation for the class and its capabilities (yes, I am ignoring the fact that the protected and private parts of the class are also visible in the declaration – it is far from perfect). Putting an inline (or template) function directly in the declaration clutters up that list of public member functions. You can place the definition later in the header file to get inlining (or template definitions) but still have an uncluttered declaration.


Operator Overloading

Do not overload operators except in rare, special circumstances. Do not create user-defined literals.

I don’t have enough experience yet with user defined literals to comment on whether they are a Good Thing or a Bad Thing, but I am prepared to put operator overloading firmly into the Good Thing category. Yes, operator overloading can be misused, but so can everything in C++. If we’re going to start banning things because of the potential for misuse we’re not going to end up with much of a language.

Overloaded operators are more playful names for functions that are less-colorfully named, such as Equals() or Add().

Oww. That’s the way to be condescending towards overloaded operators. For “playful” I would substitute “clear, obvious, concise and well defined”. When you overload an operator you cannot change the associativity, precedence or number of arguments that it takes. Who knows how many additional (but defaulted) parameters there are in that Equals function?

In particular, do not overload operator== or operator< just so that your class can be used as a key in an STL container; instead, you should create equality and comparison functor types when declaring the container.

As far as I am concerned, being used as a key or being searched for is a great reason to overload operator == and operator <. My usual practice is that operator == compares every independent value in the class and operator < gives an ordering that also depends on every independent value. If I need something different (for example an equality test that just compares the names of two objects) I’ll write a separate function.


Auto

Google likes auto. I still have significant reservations about it. At some point I will write up my concerns, but they can be summarized as “auto makes code easier to write but harder to read”.


Formatting

Coding style and formatting are pretty arbitrary

In many ways formatting is arbitrary, but I have always liked the formatting to be consistent. Even when the formatting seems arbitrary, there are still smart people who can find reasons for preferring one format to another. For example, in Object-Oriented Software Construction, Bertrand Meyer has the following to say about spaces:

The general rule, for simplicity and ease of remembering, is to follow as closely as possible the practice of standard written language. By default we will assume this language to be English, although it may be appropriate to adapt the conventions to the slightly different rules of other languages.

Here are some of the consequences. You will use a space:

  • Before an opening parenthesis, but not after:
    f (x) (not f(x), the C style, or f( x))

The argument isn’t a slam dunk “this is obviously the only sensible way to lay out parenthesis and if it isn’t followed the world will fall apart”, on the other hand it is an argument with an actual reason attached (as opposed to “I prefer it this way”).

Steve McConnell’s Code Complete 2nd edition has lots to say about formatting, backed up by actual reasons. For example, sections 31.3 and 31.4 come up with an interesting way of looking at block and brace placement and indentation. Unfortunately the brace placement that I favour for my personal projects fares badly under McConnell’s arguments.

Google’s preferred function declaration and definition syntax looks like this:

ReturnType ClassName::ReallyLongFunctionName(Type par_name1, Type par_name2,
                                             Type par_name3) {
  DoSomething();
  ...
}

In section 31.7 McConnell points out that this approach does not hold up well under modification – if the length of ReturnType, ClassName or ReallyLongFunctionName change then you have to re-indent the following lines. As McConnell says:

styles that are hard to maintain aren’t maintained

Ever since I started programming I have always heard of the idea of having a tool that automatically formats code to the house standard when it is checked in, and formats it to the individual developer’s standard when it is checked out. I think this is a fabulous idea but I have never come across any development shop that is actually using it – I would love somebody to point me to a working example. Of course the situation is exacerbated in C++ because formatting the language is extremely difficult. I tried a number of formatting tools a few years ago and they all had significant problems so I haven’t investigated them since. I hear good things about astyle – I will take a look at it some time.

I am jealous of Go. It has a standard format and a tool (gofmt) to lay code out in that format.


In conclusion, there are many parts of the guide I agree with (I haven’t mentioned most of them because that doesn’t fit my definition of “interesting”), but there are some major places where it looks like Google are hampering themselves. I said that just having a company-wide style guide was an achievement, but that style guide really needs to be working for them, not against them.

To finish on a positive note, I particularly like this statement at the end of the style guide:

The point of having style guidelines is to have a common vocabulary of coding so people can concentrate on what you are saying, rather than on how you are saying it.

Hear hear.