My current home project does a lot of processing and since I paid money for a 4 core processor I would like to do as much of that in parallel as possible. I have been using Intel’s Threading Building Blocks (TBB) for a while now and although some of what TBB does has been taken over by the thread library in C++11 there are still some useful features in TBB. In particular, TBB provides a set of concurrent containers:
- map
- set
- hashmap
- queue
- bounded queue
- priority queue
- vector
One of the interesting things about the concurrent containers is that they don’t just consist of a clever implementation behind the same API as our regular containers, the API itself needs to be different. std::vector has been around for 20 years or so. I have made plenty of use of it, and written some vector-like containers (usually because I want different speed / space tradeoffs) that have used the same API as std::vector so they can be drop in replacements. However, once we start trying to create a concurrent version of a vector we run into problems with the API. In this post I am going to compare a C++98 vector with a TBB 4.2 (update 2) concurrent_vector and see what changes Intel have made to the API.
(Disclaimer: I have no insider knowledge of the Intel’s design process for TBB, these are simply my own thoughts based on using TBB, reading Intel’s documentation and reading James Reinders’ book “Intel Threading Building Blocks”)
I highly recommend reading Eric Lippert’s post What is this thing you call “thread safe”?. Eric’s conclusion is that “thread safe” is a vague term and we need to get much more specific about the behaviours that we claim are “thread safe”. For this post I am going to define “thread safe” as meaning “has an API that allows us to do useful things on multiple threads simultaneously”.
What if we do the simplest and most naïve thing possible? Let’s just take our implementation of std::vector, add a mutex and put a lock around the implementation of every method. We might argue that the vector is now “thread safe”. Depending on our definition of “thread safe”, we might even be correct. What we have not done is create something that’s useful. Take this simple piece of code:
void fn( std::vector< int >& v_i )
{
v_i.push_back( 7 );
}
Assume that v_i is shared across threads. Once we have called push_back( 7 )
we know nothing. We don’t know what the index of the element “7” is – we can’t use the size of the vector before or after the push_back
– another thread might have changed the size of the vector, or deleted the element that we just pushed. We don’t even know whether our “7” is still in the vector (yes, we can take steps to write our multi threaded code to ensure this doesn’t happen, but the whole point of trying to create a thread safe vector is to make it easier for us to access it from multiple threads).
This leads to a principle of concurrent APIs – if you have data and/or operations that are connected you usually cannot access them separately. A single threaded vector’s push_back
method does not return a value – if we want to know where the element is in the vector we can just use size() – 1
. A concurrent vector’s push_back
must return the position of the newly appended element.
Even if we add a return value to push_back
we haven’t solved the problem of another thread changing the size of the vector. We call push_back
, we know where the element we just pushed ended up, but we do not know if it is still there when we come to use it. Erasing elements and inserting elements in the middle of the vector are problematic.
We can’t even access anything in the vector safely – we can check that our index is less than the size of the vector before we call operator []
, but the size of the vector might change between our call to size and our call to operator []
. We have to add range checking to operator []
and throw an exception if the index we pass in is out of range. This gives us another principle of concurrent APIs – many more methods have to be capable of returning errors – the concurrent nature of the API makes it difficult or impossible to guarantee that the values passed to a method will be valid when the method actually executes.
Accessing the vector via iterators is no help either, another thread can come along, push_back
a few elements and cause the vector to resize and reallocate, invalidating all iterators. The thread that is using the iterators has no way to guard against this.
On top of all of this, just putting a lock around every function is unlikely to give us good performance. Using the vector now becomes a bottleneck even if we are just reading the vector – remember that we are doing this for performance reasons.
Let’s see what Intel did to avoid of these problems.
Access
reference operator[]( size_type index ) const_reference operator[]( size_type index ) const reference at( size_type index ) const_reference at( size_type index ) const reference front() const_reference front() const reference back() const_reference back() const
I think it’s safe to say that the bare minimum we can expect from a concurrent vector is that its elements can be safely accessed from multiple threads. We have a set of access methods that are very similar to those in std::vector – operator []
does not throw an exception, at
might throw an exception. The access methods can be used concurrently.
Concurrent growth
iterator push_back(const_reference item ) iterator grow_by( size_type delta ); iterator grow_by( size_type delta, const T& t ); template<typename ForwardIterator> iterator grow_by( ForwardIterator first, ForwardIterator last ); iterator grow_to_at_least( size_type n ) iterator grow_to_at_least( size_type n, const T& t )
There are three methods (plus a couple of variations) for appending new elements to the vector.
push_back
is the function we know and love from std::vector, however it now returns an iterator to the newly appended element – when we push_back
an element we will know where it has been placed.
grow_by
appends a certain number of elements to the end of the vector – depending on which overloaded function we use we get default constructed elements, copies of a single element or a new set of elements from a given source sequence. It returns an iterator pointing to the start of the appended sequence.
grow_to_at_least
makes sure we have at least n elements in the vector, and if necessary appends default constructed elements or copies of a given element. Again, it returns an iterator pointing to the start of the appended sequence.
As with the access methods, the growth methods can be used concurrently.
It looks like we have covered the basic properties of a vector however, even if we just use the accessors and the growth methods, there is the potential for problems. The Intel documentation states:
The methods described in this section [the access methods] may be concurrently invoked on the same vector as methods for concurrent growth. However, the returned reference may be to an element that is being concurrently constructed.
I.e. if you’re appending a sequence of elements on one thread and trying to access that sequence from another thread you might get back a reference to a partially constructed element. That seems strange. Let’s work out what is going on here by taking a look at the size
method.
Size
size
seems relatively benign, but the documentation tells us that size
might include elements that have had their memory allocated but are still under construction by another thread. This is the same issue we saw earlier with the access methods – you can have a reference to a not-yet-constructed object. What’s going on?
I don’t know for sure why Intel made size
work this way, but I can speculate. I’ll use a common technique – I’ll assume the opposite and see what happens. Let’s say that size
must reflect the number of fully constructed objects – it will not include partially constructed objects.
Imagine that thread A uses grow_by
to append 50 objects to the thread, while thread B appends 1000 objects. Assume that thread A “wins” – it gets access to the memory first, allocates space for 50 objects and starts constructing them. While thread A is constructing its objects, thread B can allocate space for its 1000 objects (because the whole point is to be able to do this concurrently) and thread B starts constructing its objects. Thread A finishes first (it only has 50 elements to construct), it updates size
to reflect the additional 50 objects. Thread B finishes and it then updates size
to reflect its 1000 objects. Everyone is happy.
What if things go differently? Let’s say that thread B “wins”, allocates space for its 1000 objects and starts constructing them. Thread A then allocates space for its 50 objects and starts constructing them. Thread A finishes first (only 50 objects). What should it update size
to? Remember that in this hypothetical situation we want size
to reflect the number of fully constructed objects. Unless thread A is prepared to wait until thread B has finished constructing its objects (which would kill our concurrent performance) it cannot change size
because that would also include the partially constructed objects that thread B is working on. Since thread A cannot change size
, thread B must change it when thread B has finished constructing its objects. Thread B needs to add on the number of elements it just appended and the number of elements that thread A appended. That’s going to add a bunch of complication internally to make sure that everything gets updated correctly, and I haven’t even started on threads C through Z which might also be growing and accessing the vector.
What isn’t there?
Notice what we haven’t seen. There are no erase methods and no insert methods. You cannot remove any elements from the vector (with the exception of clear
which cannot be used concurrently) and you cannot insert elements in the middle – that means that you can’t change the position of any element once it is in the vector. There is only one place you can add elements to a concurrent vector – at the end.
Concurrent vector has the usual begin
/ end
/ rbegin
/ rend
methods for getting iterators. It also guarantees that elements will never be moved – concurrent vector does not do the std::vector trick of allocating new memory and copying everything over (and invalidating iterators). Couple that with the lack of erase methods and you get the property that once you have a valid index, reference or iterator to an element, that index, reference or iterator will stay valid until the concurrent vector is cleared or destroyed. This means that one thread cannot invalidate the references on another thread – the API is giving us some safety that we didn’t have before.
Capacity
tbb::concurrent_vector has the normal operations around size and capacity:
size_type size() const bool empty() const size_type capacity() const size_type max_size() const
To finish off the API there are the usual mixture of constructors and a destructor, none of which can be used concurrently (and it is possible to call constructors concurrently if you’re using placement new or have some half baked singleton implementation). There’s assignment (both operator =
and assign
methods), a specialization of swap
, and a shrink_to_fit
method (to match C++11’s shrink_to_fit
method). There are also some parallel iteration methods (range
) which make the concurrent vector easily usable in Intel’s concurrent algorithms.
So what’s the conclusion? Intel have stripped a lot of the functionality from std::vector but left in the core operations – growing the vector by appending elements, and accessing them. Even with that limited functionality it is still possible to do things incorrectly, however a little discipline around setup and tear down helps mitigate this. My code has an initial sequence that is single threaded, then it kicks off several threads, and on shutdown it ends up back on a single thread. I make sure that the non-concurrent methods are only called from a single thread.
Concurrent programming is hard to begin with, the fact that we need to rethink our APIs just makes it harder.
Another enlightening post, but I found it confusing for a while. If items in a tbb::concurrent_vector are never moved, then either the concurrent_vector can grow only to some pre-allocated capacity limit or concurrent_vector doesn’t necessarily store elements contiguously. I suspected it was the latter, and a trip to the tbb::concurrent_vector reference page confirmed that: a concurrent_vector is actually a collection of arrays.
For many, especially those familiar with STL, the term “vector” strongly implies contiguous storage, direct indexing, virtually unbounded growth, and certain performance characteristics. Given that, it seems odd that the tbb::concurrent_vector has “vector” right in the name. I have written similar containers (that never move an element but have less per-element overhead than a linked list). It never would have occurred to me to call such a thing a “vector.” Perhaps concurrent_piece_table would have been a better name.
Adrian, this is a great point and one that I hadn’t considered. I think I had been focused on the O(1) lookup, however there is precedent for O(1) lookup with non-contiguous storage – std::deque. I am not going to hold my breath waiting for Intel to change the name though.
A point of historical interest – the requirement that std::vector use contiguous storage was missed out of the C++98 standard, but added in C++03. More information here:
http://cplusplus.github.io/LWG/lwg-defects.html#69
http://herbsutter.com/2008/04/07/cringe-not-vectors-are-guaranteed-to-be-contiguous/
I’m not familiar with the implementation of TBB vectors, but what comes to mind when I think about random-access with a multiple-array vector is if the growth is clearly thought out then it’s probably possible to calculate exactly which array an element is in just from its index. For example, if you always grow by allocating a new array which doubles the capacity of the vector (ie the new array is the size of all the existing arrays combined), then the index of the array is simply log-base-2 of the index in the vector, and the index in the array is the index in the vector minus 2-to-the-power-of-the-index-of-the-array (I think). I’m not saying it works like this, I’m just giving an example algorithm to illustrate that it may do so. So it can have random access like a vector, and have almost identical amortized space overhead. In my mind if it quacks like a vector then it’s a vector. The unfortunate fact that the type system still allows you to accidentally discover the non-contiguity by using a pointer instead of their carefully crafted seemingly-contiguous iterators is simply a leak in the abstraction. (Bear in mind the contiguity of memory is already an abstraction on the potentially non-contiguous virtual memory pages, but thankfully that abstraction isn’t quite so leaky).
On the topic of API design, Michael Hunter (aka “Coder Mike”, aka “Mike” who comments on this blog from time to time) has an interesting post up on his blog talking about the problems of producing an API that handles sequences well. Take a look here. Right now it’s part 1 only, I’m looking forward to seeing future installments.