My current home project does a lot of processing and since I paid money for a 4 core processor I would like to do as much of that in parallel as possible. I have been using Intel’s Threading Building Blocks (TBB) for a while now and although some of what TBB does has been taken over by the thread library in C++11 there are still some useful features in TBB. In particular, TBB provides a set of concurrent containers:
- bounded queue
- priority queue
One of the interesting things about the concurrent containers is that they don’t just consist of a clever implementation behind the same API as our regular containers, the API itself needs to be different. std::vector has been around for 20 years or so. I have made plenty of use of it, and written some vector-like containers (usually because I want different speed / space tradeoffs) that have used the same API as std::vector so they can be drop in replacements. However, once we start trying to create a concurrent version of a vector we run into problems with the API. In this post I am going to compare a C++98 vector with a TBB 4.2 (update 2) concurrent_vector and see what changes Intel have made to the API.
(Disclaimer: I have no insider knowledge of the Intel’s design process for TBB, these are simply my own thoughts based on using TBB, reading Intel’s documentation and reading James Reinders’ book “Intel Threading Building Blocks”)
I highly recommend reading Eric Lippert’s post What is this thing you call “thread safe”?. Eric’s conclusion is that “thread safe” is a vague term and we need to get much more specific about the behaviours that we claim are “thread safe”. For this post I am going to define “thread safe” as meaning “has an API that allows us to do useful things on multiple threads simultaneously”.
What if we do the simplest and most naïve thing possible? Let’s just take our implementation of std::vector, add a mutex and put a lock around the implementation of every method. We might argue that the vector is now “thread safe”. Depending on our definition of “thread safe”, we might even be correct. What we have not done is create something that’s useful. Take this simple piece of code:
void fn( std::vector< int >& v_i )
v_i.push_back( 7 );
Assume that v_i is shared across threads. Once we have called
push_back( 7 ) we know nothing. We don’t know what the index of the element “7” is – we can’t use the size of the vector before or after the
push_back – another thread might have changed the size of the vector, or deleted the element that we just pushed. We don’t even know whether our “7” is still in the vector (yes, we can take steps to write our multi threaded code to ensure this doesn’t happen, but the whole point of trying to create a thread safe vector is to make it easier for us to access it from multiple threads).
This leads to a principle of concurrent APIs – if you have data and/or operations that are connected you usually cannot access them separately. A single threaded vector’s
push_back method does not return a value – if we want to know where the element is in the vector we can just use
size() – 1. A concurrent vector’s
push_back must return the position of the newly appended element.
Even if we add a return value to
push_back we haven’t solved the problem of another thread changing the size of the vector. We call
push_back, we know where the element we just pushed ended up, but we do not know if it is still there when we come to use it. Erasing elements and inserting elements in the middle of the vector are problematic.
We can’t even access anything in the vector safely – we can check that our index is less than the size of the vector before we call
operator , but the size of the vector might change between our call to size and our call to
operator . We have to add range checking to
operator  and throw an exception if the index we pass in is out of range. This gives us another principle of concurrent APIs – many more methods have to be capable of returning errors – the concurrent nature of the API makes it difficult or impossible to guarantee that the values passed to a method will be valid when the method actually executes.
Accessing the vector via iterators is no help either, another thread can come along,
push_back a few elements and cause the vector to resize and reallocate, invalidating all iterators. The thread that is using the iterators has no way to guard against this.
On top of all of this, just putting a lock around every function is unlikely to give us good performance. Using the vector now becomes a bottleneck even if we are just reading the vector – remember that we are doing this for performance reasons.
Let’s see what Intel did to avoid of these problems.
reference operator( size_type index ) const_reference operator( size_type index ) const reference at( size_type index ) const_reference at( size_type index ) const reference front() const_reference front() const reference back() const_reference back() const
I think it’s safe to say that the bare minimum we can expect from a concurrent vector is that its elements can be safely accessed from multiple threads. We have a set of access methods that are very similar to those in std::vector –
operator  does not throw an exception,
at might throw an exception. The access methods can be used concurrently.
iterator push_back(const_reference item ) iterator grow_by( size_type delta ); iterator grow_by( size_type delta, const T& t ); template<typename ForwardIterator> iterator grow_by( ForwardIterator first, ForwardIterator last ); iterator grow_to_at_least( size_type n ) iterator grow_to_at_least( size_type n, const T& t )
There are three methods (plus a couple of variations) for appending new elements to the vector.
push_back is the function we know and love from std::vector, however it now returns an iterator to the newly appended element – when we
push_back an element we will know where it has been placed.
grow_by appends a certain number of elements to the end of the vector – depending on which overloaded function we use we get default constructed elements, copies of a single element or a new set of elements from a given source sequence. It returns an iterator pointing to the start of the appended sequence.
grow_to_at_least makes sure we have at least n elements in the vector, and if necessary appends default constructed elements or copies of a given element. Again, it returns an iterator pointing to the start of the appended sequence.
As with the access methods, the growth methods can be used concurrently.
It looks like we have covered the basic properties of a vector however, even if we just use the accessors and the growth methods, there is the potential for problems. The Intel documentation states:
The methods described in this section [the access methods] may be concurrently invoked on the same vector as methods for concurrent growth. However, the returned reference may be to an element that is being concurrently constructed.
I.e. if you’re appending a sequence of elements on one thread and trying to access that sequence from another thread you might get back a reference to a partially constructed element. That seems strange. Let’s work out what is going on here by taking a look at the
size seems relatively benign, but the documentation tells us that
size might include elements that have had their memory allocated but are still under construction by another thread. This is the same issue we saw earlier with the access methods – you can have a reference to a not-yet-constructed object. What’s going on?
I don’t know for sure why Intel made
size work this way, but I can speculate. I’ll use a common technique – I’ll assume the opposite and see what happens. Let’s say that
size must reflect the number of fully constructed objects – it will not include partially constructed objects.
Imagine that thread A uses
grow_by to append 50 objects to the thread, while thread B appends 1000 objects. Assume that thread A “wins” – it gets access to the memory first, allocates space for 50 objects and starts constructing them. While thread A is constructing its objects, thread B can allocate space for its 1000 objects (because the whole point is to be able to do this concurrently) and thread B starts constructing its objects. Thread A finishes first (it only has 50 elements to construct), it updates
size to reflect the additional 50 objects. Thread B finishes and it then updates
size to reflect its 1000 objects. Everyone is happy.
What if things go differently? Let’s say that thread B “wins”, allocates space for its 1000 objects and starts constructing them. Thread A then allocates space for its 50 objects and starts constructing them. Thread A finishes first (only 50 objects). What should it update
size to? Remember that in this hypothetical situation we want
size to reflect the number of fully constructed objects. Unless thread A is prepared to wait until thread B has finished constructing its objects (which would kill our concurrent performance) it cannot change
size because that would also include the partially constructed objects that thread B is working on. Since thread A cannot change
size, thread B must change it when thread B has finished constructing its objects. Thread B needs to add on the number of elements it just appended and the number of elements that thread A appended. That’s going to add a bunch of complication internally to make sure that everything gets updated correctly, and I haven’t even started on threads C through Z which might also be growing and accessing the vector.
What isn’t there?
Notice what we haven’t seen. There are no erase methods and no insert methods. You cannot remove any elements from the vector (with the exception of
clear which cannot be used concurrently) and you cannot insert elements in the middle – that means that you can’t change the position of any element once it is in the vector. There is only one place you can add elements to a concurrent vector – at the end.
Concurrent vector has the usual
rend methods for getting iterators. It also guarantees that elements will never be moved – concurrent vector does not do the std::vector trick of allocating new memory and copying everything over (and invalidating iterators). Couple that with the lack of erase methods and you get the property that once you have a valid index, reference or iterator to an element, that index, reference or iterator will stay valid until the concurrent vector is cleared or destroyed. This means that one thread cannot invalidate the references on another thread – the API is giving us some safety that we didn’t have before.
tbb::concurrent_vector has the normal operations around size and capacity:
size_type size() const bool empty() const size_type capacity() const size_type max_size() const
To finish off the API there are the usual mixture of constructors and a destructor, none of which can be used concurrently (and it is possible to call constructors concurrently if you’re using placement new or have some half baked singleton implementation). There’s assignment (both
operator = and
assign methods), a specialization of
swap, and a
shrink_to_fit method (to match C++11’s
shrink_to_fit method). There are also some parallel iteration methods (
range) which make the concurrent vector easily usable in Intel’s concurrent algorithms.
So what’s the conclusion? Intel have stripped a lot of the functionality from std::vector but left in the core operations – growing the vector by appending elements, and accessing them. Even with that limited functionality it is still possible to do things incorrectly, however a little discipline around setup and tear down helps mitigate this. My code has an initial sequence that is single threaded, then it kicks off several threads, and on shutdown it ends up back on a single thread. I make sure that the non-concurrent methods are only called from a single thread.
Concurrent programming is hard to begin with, the fact that we need to rethink our APIs just makes it harder.