Quotes of the week – hardware vs. software

… it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!

Donald Knuth (Quoted here)

Computer science is no more about computers than astronomy is about telescopes.

Edsger Dijkstra (although disputed)

Real computer scientists despise the idea of actual hardware. Hardware has limitations, software doesn’t. It’s a real shame that Turing machines are so poor at I/O.

Anon

Much as I hate to disagree with Donald Knuth, I think he is being overly harsh to the hardware manufacturers (at least on this particular topic). Chips are now being produced with 14 nm features. Chip designers are faced with some fundamental limits of the universe: quantum effects and the speed of light matter at this scale. They are also having to deal with billions of transistors on a chip. I am always amazed that anything computer based works at all.

I used Donald Knuth’s quote at a talk Chuck Rose and I gave at a domain specific languages conference a few years ago. The talk can be found here. It goes into detail about Pixel Bender, a high performance language for image processing (i.e. very fast, very cool special effects on images), that can use as much parallelism as is available.

As usual, Dijkstra makes the rest of us look like we are banging our heads against the keyboard and accepting whatever results as a valid program. The University of Texas maintains an archive of Dijkstra’s manuscripts. (See also a previous quote of the week).

And finally, while “real computer scientists” might “despise the idea of actual hardware”, some of us have to deal with actual hardware to make a living. I am immensely grateful for the fact that we have top notch thinkers like Knuth and Dijkstra, and I am also grateful that we have engineers at the chip companies doing the hard work necessary to give us processors that work.

Inconsistent sorting

A few years ago I was working on a prototype for some 3D drawing code. I was using the painters algorithm – draw the objects from far to near so that the close objects will occlude the distant objects. I had my objects in a vector, I sorted them according to their distance from the camera, then drew the objects. Then I noticed something odd. When I first started the program I had two objects that would flicker. They would swap which one was in front – sometimes object A would be drawn in front of object B, sometimes B was drawn in front of A.

My start up coordinates for the camera and the objects meant that A and B were exactly the same distance from the camera, so when I did the sort, A and B would compare equal. Since I wasn’t using std::stable_sort I wasn’t expecting any particular ordering for A and B, but I was expecting the ordering to be consistent between calls to std::sort. After some debugging I discovered that giving the exact same input to std::sort would not always give the same output – objects with identical distances might be sorted differently on each call.

I ended up digging into the standard library implementation. I think I was using Metrowerks C++, but I wouldn’t bet on it. The implementation of std::sort used Quicksort – nothing unusual there. One of the problems with Quicksort though is that you have to make good choices of pivot element – the element that you use to to split the input into two partitions. For example, if you use the first element as the pivot and hand Quicksort an already sorted sequence the runtime ends up being O(n^2). Quicksort has great average case behaviour but bad worst case behaviour. There are various techniques for picking the pivot element including:

first element
last element
middle element
median of the first, last and middle elements

This particular implementation had found a different method of picking the pivot – it used rand. Effective since the chances of providing the sort with a worst case sequence were low, but it led to this odd behaviour – the same input to std::sort would not necessarily lead to the same output. I switched to std::stable_sort and all my problems went away.

As far as I can tell from the standard (section 25.4.1.1), this behaviour for std::sort is entirely legal. std::sort does not guarantee the order in which equivalent elements are placed, and I don’t see anything that forces std::sort to be consistent. I have never found this implementation in any other standard library, but I have heard of one person who was caught out like I was.

Quote of the week – what’s important?

How the teacher reacts when something goes wrong tells the class what’s important.

Bruce Hamilton

Bruce is a dance teacher par excellence and originally said this in the context of teaching Scottish and English folk dancing. It also applies to any sort of teacher / student interaction, and any interaction where someone has power or influence over someone else. Let’s paraphrase the quote:

How the manager reacts when something goes wrong tells the team what’s important.

or perhaps:

How the senior developer reacts when something goes wrong tells the junior developers what’s important.

I have had the misfortune to work on several teams where one person (usually, but not always, the manager) had to have their ego maintained at all costs. If anything went wrong it was not their fault, and any attempt to get them to take responsibility was doomed to (very unpleasant) failure. It was very clear what was most important – their ego. Because of that, entry #2 on the importance scale was for the rest of the team to protect themselves from the egoist. If we were lucky, entry #3 might involve producing a quality product.

std::vector vs. tbb::concurrent_vector – API design

My current home project does a lot of processing and since I paid money for a 4 core processor I would like to do as much of that in parallel as possible. I have been using Intel’s Threading Building Blocks (TBB) for a while now and although some of what TBB does has been taken over by the thread library in C++11 there are still some useful features in TBB. In particular, TBB provides a set of concurrent containers:

map
set
hashmap
queue
bounded queue
priority queue
vector

One of the interesting things about the concurrent containers is that they don’t just consist of a clever implementation behind the same API as our regular containers, the API itself needs to be different. std::vector has been around for 20 years or so. I have made plenty of use of it, and written some vector-like containers (usually because I want different speed / space tradeoffs) that have used the same API as std::vector so they can be drop in replacements. However, once we start trying to create a concurrent version of a vector we run into problems with the API. In this post I am going to compare a C++98 vector with a TBB 4.2 (update 2) concurrent_vector and see what changes Intel have made to the API.

(Disclaimer: I have no insider knowledge of the Intel’s design process for TBB, these are simply my own thoughts based on using TBB, reading Intel’s documentation and reading James Reinders’ book “Intel Threading Building Blocks”)

I highly recommend reading Eric Lippert’s post What is this thing you call “thread safe”?. Eric’s conclusion is that “thread safe” is a vague term and we need to get much more specific about the behaviours that we claim are “thread safe”. For this post I am going to define “thread safe” as meaning “has an API that allows us to do useful things on multiple threads simultaneously”.

What if we do the simplest and most naïve thing possible? Let’s just take our implementation of std::vector, add a mutex and put a lock around the implementation of every method. We might argue that the vector is now “thread safe”. Depending on our definition of “thread safe”, we might even be correct. What we have not done is create something that’s useful. Take this simple piece of code:

void fn( std::vector< int >& v_i )
{
    v_i.push_back( 7 );
}

Assume that v_i is shared across threads. Once we have called push_back( 7 ) we know nothing. We don’t know what the index of the element “7” is – we can’t use the size of the vector before or after the push_back – another thread might have changed the size of the vector, or deleted the element that we just pushed. We don’t even know whether our “7” is still in the vector (yes, we can take steps to write our multi threaded code to ensure this doesn’t happen, but the whole point of trying to create a thread safe vector is to make it easier for us to access it from multiple threads).

This leads to a principle of concurrent APIs – if you have data and/or operations that are connected you usually cannot access them separately. A single threaded vector’s push_back method does not return a value – if we want to know where the element is in the vector we can just use size() – 1. A concurrent vector’s push_back must return the position of the newly appended element.

Even if we add a return value to push_back we haven’t solved the problem of another thread changing the size of the vector. We call push_back, we know where the element we just pushed ended up, but we do not know if it is still there when we come to use it. Erasing elements and inserting elements in the middle of the vector are problematic.

We can’t even access anything in the vector safely – we can check that our index is less than the size of the vector before we call operator [], but the size of the vector might change between our call to size and our call to operator []. We have to add range checking to operator [] and throw an exception if the index we pass in is out of range. This gives us another principle of concurrent APIs – many more methods have to be capable of returning errors – the concurrent nature of the API makes it difficult or impossible to guarantee that the values passed to a method will be valid when the method actually executes.

Accessing the vector via iterators is no help either, another thread can come along, push_back a few elements and cause the vector to resize and reallocate, invalidating all iterators. The thread that is using the iterators has no way to guard against this.

On top of all of this, just putting a lock around every function is unlikely to give us good performance. Using the vector now becomes a bottleneck even if we are just reading the vector – remember that we are doing this for performance reasons.

Let’s see what Intel did to avoid of these problems.

Access

reference operator[]( size_type index )
const_reference operator[]( size_type index ) const

reference at( size_type index )
const_reference at( size_type index ) const

reference front()
const_reference front() const

reference back()
const_reference back() const

I think it’s safe to say that the bare minimum we can expect from a concurrent vector is that its elements can be safely accessed from multiple threads. We have a set of access methods that are very similar to those in std::vector – operator [] does not throw an exception, at might throw an exception. The access methods can be used concurrently.

Concurrent growth

iterator push_back(const_reference item )

iterator grow_by( size_type delta );
iterator grow_by( size_type delta, const T& t );
template<typename ForwardIterator> 
iterator grow_by( ForwardIterator first, ForwardIterator last );

iterator grow_to_at_least( size_type n )
iterator grow_to_at_least( size_type n, const T& t )

There are three methods (plus a couple of variations) for appending new elements to the vector.

push_back is the function we know and love from std::vector, however it now returns an iterator to the newly appended element – when we push_back an element we will know where it has been placed.

grow_by appends a certain number of elements to the end of the vector – depending on which overloaded function we use we get default constructed elements, copies of a single element or a new set of elements from a given source sequence. It returns an iterator pointing to the start of the appended sequence.

grow_to_at_least makes sure we have at least n elements in the vector, and if necessary appends default constructed elements or copies of a given element. Again, it returns an iterator pointing to the start of the appended sequence.

As with the access methods, the growth methods can be used concurrently.

It looks like we have covered the basic properties of a vector however, even if we just use the accessors and the growth methods, there is the potential for problems. The Intel documentation states:

The methods described in this section [the access methods] may be concurrently invoked on the same vector as methods for concurrent growth. However, the returned reference may be to an element that is being concurrently constructed.

I.e. if you’re appending a sequence of elements on one thread and trying to access that sequence from another thread you might get back a reference to a partially constructed element. That seems strange. Let’s work out what is going on here by taking a look at the size method.

Size

size seems relatively benign, but the documentation tells us that size might include elements that have had their memory allocated but are still under construction by another thread. This is the same issue we saw earlier with the access methods – you can have a reference to a not-yet-constructed object. What’s going on?

I don’t know for sure why Intel made size work this way, but I can speculate. I’ll use a common technique – I’ll assume the opposite and see what happens. Let’s say that size must reflect the number of fully constructed objects – it will not include partially constructed objects.

Imagine that thread A uses grow_by to append 50 objects to the thread, while thread B appends 1000 objects. Assume that thread A “wins” – it gets access to the memory first, allocates space for 50 objects and starts constructing them. While thread A is constructing its objects, thread B can allocate space for its 1000 objects (because the whole point is to be able to do this concurrently) and thread B starts constructing its objects. Thread A finishes first (it only has 50 elements to construct), it updates size to reflect the additional 50 objects. Thread B finishes and it then updates size to reflect its 1000 objects. Everyone is happy.

What if things go differently? Let’s say that thread B “wins”, allocates space for its 1000 objects and starts constructing them. Thread A then allocates space for its 50 objects and starts constructing them. Thread A finishes first (only 50 objects). What should it update size to? Remember that in this hypothetical situation we want size to reflect the number of fully constructed objects. Unless thread A is prepared to wait until thread B has finished constructing its objects (which would kill our concurrent performance) it cannot change size because that would also include the partially constructed objects that thread B is working on. Since thread A cannot change size, thread B must change it when thread B has finished constructing its objects. Thread B needs to add on the number of elements it just appended and the number of elements that thread A appended. That’s going to add a bunch of complication internally to make sure that everything gets updated correctly, and I haven’t even started on threads C through Z which might also be growing and accessing the vector.

What isn’t there?

Notice what we haven’t seen. There are no erase methods and no insert methods. You cannot remove any elements from the vector (with the exception of clear which cannot be used concurrently) and you cannot insert elements in the middle – that means that you can’t change the position of any element once it is in the vector. There is only one place you can add elements to a concurrent vector – at the end.

Concurrent vector has the usual begin / end / rbegin / rend methods for getting iterators. It also guarantees that elements will never be moved – concurrent vector does not do the std::vector trick of allocating new memory and copying everything over (and invalidating iterators). Couple that with the lack of erase methods and you get the property that once you have a valid index, reference or iterator to an element, that index, reference or iterator will stay valid until the concurrent vector is cleared or destroyed. This means that one thread cannot invalidate the references on another thread – the API is giving us some safety that we didn’t have before.

Capacity

tbb::concurrent_vector has the normal operations around size and capacity:

size_type size() const
bool empty() const
size_type capacity() const
size_type max_size() const

To finish off the API there are the usual mixture of constructors and a destructor, none of which can be used concurrently (and it is possible to call constructors concurrently if you’re using placement new or have some half baked singleton implementation). There’s assignment (both operator = and assign methods), a specialization of swap, and a shrink_to_fit method (to match C++11’s shrink_to_fit method). There are also some parallel iteration methods (range) which make the concurrent vector easily usable in Intel’s concurrent algorithms.

So what’s the conclusion? Intel have stripped a lot of the functionality from std::vector but left in the core operations – growing the vector by appending elements, and accessing them. Even with that limited functionality it is still possible to do things incorrectly, however a little discipline around setup and tear down helps mitigate this. My code has an initial sequence that is single threaded, then it kicks off several threads, and on shutdown it ends up back on a single thread. I make sure that the non-concurrent methods are only called from a single thread.

Concurrent programming is hard to begin with, the fact that we need to rethink our APIs just makes it harder.

Quote of the week – (not) working

If there’s one thing worse than a program that doesn’t work when it should, it’s a program that does work when it shouldn’t.

Bob Archer

I am sure that someone said this long before I did, but I can’t find a quote expressing quite the same sentiments.

At least with a program that doesn’t work you know it doesn’t work. You might not know why it doesn’t work, or how to fix it but you still have the knowledge that it doesn’t work. A program that does work when it shouldn’t will, of course, work right up until that big demo to the CEO or a major customer. Or, it will start exhibiting a problem right before code freeze and, upon tracking down the source of the problem, the poor engineer will discover code that has never worked and needs to be completely rethought.

The interesting thing is why the code appears to work. Sometimes two bugs cancel each other out (mostly), sometimes (as in my most recent example) it’s accessing memory after it was freed (very shortly after it was freed) – that worked on 7 out of 8 platforms. Sometimes the code isn’t being called at all – I have often been surprised at a break point that wasn’t hit, or looked at the profiler results to discover that a function that should be called thousands of times hasn’t been called at all.

No raw loops – output

Let’s get back to raw loops, or rather the absence of raw loops. How do we write out the contents of a container so that each element appears on its own line? Assume we have some type T and a vector of T:

class T
{
   ...
};

std::vector< T > v_T;

This is the old style way to write the contents:

std::ostringstream ostr;
for( std::vector< T >::const_iterator i( std::begin( v_T ) ); 
    i != std::end( v_T ); ++i )
{
    ostr << *i << "\n";
}

The range-based for solution is straightforward:

for( T const& t : v_T )
{
    ostr << t << "\n";
}

Lambda functions also provide a solution:

std::for_each( 
    std::begin( v_T ), 
    std::end( v_T ), 
    [ &ostr ]( T const& t ){
        ostr << t << "\n";
    } );

adobe::for_each( 
    v_T, 
    [ &ostr ]( T const& t ){
        ostr << t << "\n";
    } );

Conveniently, output streams have iterators. We can use std::ostream_iterator as the destination for std::copy or adobe::copy :

std::copy( 
    std::begin( v_T ), 
    std::end( v_T ), 
    std::ostream_iterator< T >( 
        ostr, 
        "\n" ) );

adobe::copy( 
    v_T, 
    std::ostream_iterator< T >( 
        ostr, 
        "\n" ) );

Notice that we can pass a string (called the delimiter string in the standard) to the std::ostream_iterator constructor. The delimiter string is written out after every element.

Since std::ostream_iterator is an iterator with std::ostream as its underlying container, we might thank that we would be able to do this:

std::copy( 
    begin( v_T ), 
    end( v_T ), 
    std::begin( ostr ) );

We can't. The standard does not include a specialization of std::begin for std::ostream, and we are not allowed to add one ourselves because of Section 17.6.4.2.1 of the C++11 standard which states:

A program may add a template specialization for any standard library template to namespace std only if the declaration depends on a user-defined type and the specialization meets the standard library requirements for the original template and is not explicitly prohibited

In addition to that, std::begin does not allow for the additional delimiter string argument. Also, std::ostream_iterator is instantiated for the type of the source value which is not present in the single argument we pass to std::begin so cannot be deduced.

There is nothing stopping us writing our own function to construct a std::ostream_iterator from std::ostream. Since it's our function it can take an additional argument for the delimiter string, and so long as we don't put it into the standard namespace we can even call it begin. We do have to specify the type of our output when we call the function (the function cannot deduce it from the arguments because that type never appears in the arguments):

template< typename U >
std::ostream_iterator< U >
begin( std::ostream& ostr, const char* delimiterString )
{
    return std::ostream_iterator< U >( ostr, delimiterString );
}

std::copy( 
    begin( v_T ), 
    end( v_T ), 
    begin< T >( ostr, "\n" ) );

And an example that almost works. Since operator << is just a function, we can pass it to std::bind :

std::for_each(
    std::begin( v_T ), 
    std::end( v_T ),
    std::bind( operator <<, std::ref( ostr ), _1 ) );

This writes out the elements of the container, but it does not put each element on its own line - there is nowhere to plug the "\n" into. We can't achieve the each element appears on its own line requirement from our original problem statement.

We can write a function that takes an object of type T and a delimiter string and writes out both of them:

void
writePostText( std::ostream& o, T const& t, const char* text )
{
    o << t << text;
}

std::for_each(
    std::begin( v_T ), 
    std::end( v_T ),
    std::bind( writePostText, std::ref( ostr ), _1, "\n" ) );

Which of these solutions is the best? As usual there isn't a definitive answer to that question, but it looks like the range-based for solution looks has the lowest overhead (overhead being bits of boilerplate syntax that are not directly used to solve the problem), with adobe::for_each close behind.

There is one more thing to throw into the mix. The range-based for looks good, however it can only iterate over the entire container. Most of the time this isn't a problem, I estimate that 95% of my algorithm usage is over the entire container. In the case of writing things out though, there is a common use case where you want to iterate over a portion of the container. If you are writing out JSON, or some similar format where you need to put text between each pair of elements (i.e. the text cannot appear before the first element or after the last element), you have to be able to iterate over a sub range of the container.

A solution to the "separator text" problem might look like this:

void
writePreText( std::ostream& o, const char* text, T const& t )
{
    o << text << t;
}

if( !v_T.empty() )
{
    std::vector< T >::const_iterator i( std::begin( v_T ) ); 
    
    ostr << *i;

    ++i;
    std::for_each(
        i,
        std::cend( v_T ),
        std::bind( writePreText, std::ref( ostr ), ",\n", _1 ) );
}

We can't do this with range-based for. If we need to do both things within our source code - writing with text after each element and writing text between each element - do we want to do those two things differently - one of them using range-based for, one using a standard algorithm? There is a cost to being inconsistent, there is also a cost to being consistent but using a slightly larger solution for one of the cases.

I know that I am overanalyzing this. In practice I wrote a templated writeContainer function which lets you specify text that is written before the first element, text that is written after the last element and text that is written between elements. Problem solved and, with the exception of this blog post, I don't have to consider the subject any more.

Quote of the week – second best

… distortions in one market … can justify government intervention in other, related markets. Second-best arguments have a dubious reputation in economics, because the right policy is always to eliminate the primary distortion, if you can.

Paul Krugman, The big green test

Before jumping into my commentary, I want to emphasize that this is not intended to be a political “quote of the week”, and any comments that are political rather than technical will not be approved. Actually, since I haven’t put up the page with my moderation policy yet, I will point out that any comments that are religious rather than technical will also be consigned to /dev/null. Of course, if Pope Francis or President Obama ever venture an opinion on lambdas vs. std::bind feel free to comment on that subject.

I suggest reading the whole piece since I had to mangle the quote to get it into a reasonable form. The reason I am quoting Paul Krugman at all is that I think that the economic distortions he is talking about have an equivalent in code. Something big is wrong (call it X) and can’t be changed (or someone with some authority believes it can’t be changed) so the team spends the rest of the project working around it, usually with increasingly fragile hacks.

I believe that the cost of not changing X is routinely underestimated. The cost doesn’t come in one big lump, it is spread out over weeks and months of having to work around X – a colleague of mine referred to these problems as “sharp corners”. They don’t actually stop you from doing something but they make it unpleasant enough that you try to avoid the sharp corners. That leads to second best solutions which then lead to further sharp corners and more second best solutions. Changing X costs you in the short term, not changing it results in long term smaller, cumulative costs.

You need to know why you can’t change X. The reasons are important, the reasons might apply now, they might not apply in 6 months. Take the long view – do you still want to be dealing with X in 2 years (or 5 years or 10 years)? Even if you can’t change X right now, perhaps there are incremental changes you can make so that it will be possible to change X at a future date.

Think about what X would look like if you could change it. How close can you get to that? You might be able to wrap X, although wrapping comes with costs of its own. I have worked on systems that wrapped almost everything and they were a pain to deal with.

I have been on teams that fail to get to grips with their version of X, and I have been on teams where we did manage to handle X well. It is wonderful when you can take something that was causing constant pain and eliminate it – it’s like a toothache that you get treated.

Random programming