Quote of the week – outsmarting the compiler

Trying to outsmart a compiler defeats much of the purpose of using one.

Kernighan and Plauger The Elements of Programming Style.

I have seen many people try to outsmart compilers (sometimes that person was me) and with rare exceptions, the compiler has won every time.

I have seen a programmer who had just read Mike Abrash’s “Zen of Assembly Language” code up their own memcpy routine using all the the clever optimizations in the book, only to discover that the compiler writers know all of these optimizations too. I have seen people try and get clever with switch statements and implement them as jump tables. Steve Johnson’s portable C compiler optimized switch statements automatically in the 80s. I have seen a programmer who was an expert on the 68000 instruction set (and he genuinely was an expert) come unstuck because we were now compiling for the 68040 and the instruction timings had changed.

In the late 90s there was a discussion on one of the games programming newsgroups about using left shift instead of multiply. Rather than this:

j = i * 2;

you’d write:

j = i << 1;

and get a wonderful speed increase. I decided to run some tests on this. I coded up a little test program with two functions:

int multiplyByTwo( int i )
{
    return i * 2;
}
int shiftLeftByOne( int i )
{
    return i << 1;
}

I ran the program through Watcom with our normal release build settings, then looked at the assembler output. Watcom had replaced i * 2 with i << 1, it had realized that the two functions were the same so it had merged them, then it inlined the function. Furthermore, since I was using constants to keep the program simple it calculated the results at compile time. I think that the only thing left by the end was a call to printf with a couple of constants. Compilers are good at this stuff.

There's another problem - if we are going to try doing these replacements ourselves we have to make sure that we get them right. Replacing multiply with a left shift is easy and safe to do, however, dividing by two is not necessarily the same as shifting right by one - under some circumstances you can safely use the shift, but under others you can't. I have always found that a compiler is far more likely to get these things right than I am.

The moral is for us to write code that expresses exactly what we want (if we want a multiplication then we should write a multiplication) and let the compiler do whatever it feels is appropriate. Compilers are really good at peephole optimizations. As programmers we are much better off using our time to replace that O(n^2) algorithm with an O(n log n) algorithm.

Include guards

In 1996 John Lakos published “Large-Scale C++ Software Design”, a book containing the lessons learned from a large scale C++ project developed at Mentor Graphics.

I want to make it clear that this post is not meant as a criticism of John Lakos or his book. First, I don’t have a copy of the book in front of me so I am working from memory – I might have things wrong. Second, John Lakos was describing his experience at a particular time and place. We are at a different time and place. Among other things, compilers, hard disks, processors, memory and networks have changed enormously over the last 20 years.

One of the lessons Lakos documents involves include guards. A common idiom is for the contents of every header file to be surrounded by include guards:

myheader.hpp

#ifndef MYHEADER_HPP
#define MYHEADER_HPP

// Contents of include file go here

#endif

Even if the file is #included multiple times it will only be processed once. Since the include guard is entirely internal to the header file this is known as an “internal include guard”. The problem is that a naive compiler will still have to open and read the file every time it is included in a translation unit. If opening a file is slow (perhaps because it is being opened over a network) there is a performance hit even though the contents of the file are not being processed. As a solution to this problem, Lakos suggested using external include guards as well as internal include guards:

myheader2.hpp

#ifndef MYHEADER_HPP
#include "myheader.hpp"
#endif

The external guards ensure that once myheader.hpp has been processed it doesn’t even have to be opened again. Keep the internal include guards to ensure accuracy but use the external include guards to improve compile performance.

It turns out that some compilers are way ahead of us – they implement the “include guard optimization”. When the compiler parses a header file it looks to see if the include guard idiom is in place. If it is, the compiler notes this fact and then does the equivalent of applying its own external include guards if the file is included a second time. In 1999 a discussion arose on comp.lang.c++.moderated about external include guards vs. internal include guards vs. the include guard optimization (I haven’t been able to track down the original discussion I’m afraid. Google’s Usenet search has me baffled). I decided to run some tests and put up a webpage with my results. Some other C++ programmers ran tests on their systems that allowed me to get results from compilers and systems that I don’t have access to.

The main page is here, if you want to skip straight to the results go here.

There are a few patterns. GCC has had the optimization for a long time, VC++ has never had the optimization (although the results for the latest VC++ compiler are strange – something odd is going on). Watcom had the optimization in the 1990s (incidentally, Watcom lives on as Open Watcom, although it looks like there hasn’t been a release for 4 years).

There are a couple of important points to note. All these tests are doing is testing the include times of a simple header – the header itself is very small so there is almost no time spent processing it. The tests are being run with tens if not hundreds of thousands of includes of the same file – a situation I have never encountered in practice

A few years ago I was on a team that was upset about its compilation times and wanted to look for speedups. Tests showed that the single greatest contributor to our build time was <windows.h>. The solution to that was in 2 parts – include <windows.h> in as few places as possible (and in particular try to keep it out of header files), and use precompiled headers. The include guards didn’t come close to being a problem.

As always, nothing takes the place of running tests on your own codebase, although I would start by measuring the processing time for several commonly used headers – that is far more likely to be the problem than include guards.

Unfortunately making sure we only include the necessary headers is time consuming to do, and even more time consuming to maintain. I have never come up with the perfect solution to this problem – at times I have used imperfect solutions but they have all been inconvenient and required serious amounts of run time.

Finally, John Lakos has a new book coming out in 2015 – Large-Scale C++ Volume I: Process and Architecture. It’ll be interesting to see what he finds to focus on in our current computing environment.

Quote of the week – hard things

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

Phil Karlton

I have seen a number of variations on this quote, and it looks as though the original quote only included two of the three hard things. Martin Fowler quotes Phil Karlton here and there is more discussion of the quote here.

I don’t know whether the original quote concerned hardware caching or software caching. Since I’m not a hardware engineer, I’ll talk about software caching. When implementing software caches I have found it useful to apply the Single Responsibility Principle and divide things up along these lines:

  • The thing that calculates (fetches, obtains, looks up etc.) the value we want.
  • The thing that actually stores the cached value.
  • The thing that decides whether the value should be cached or not.
  • The thing that decides whether the cache (or a given cache entry) should be invalidated.

I wish I were better at naming things. Naming is important, good naming can show that similar things are connected and dissimilar things are separate. Good names can show patterns in the code that are otherwise difficult to see. When we’re trying to get acquainted with a new codebase, names can make make our lives heaven or hell.

It isn’t always easy to work out what “similar” means. In this comment on my tbb::concurrent_vector post, Adrian points out that std::vector guarantees contiguous storage whereas tbb::concurrent_vector doesn’t. What are the essential properties of a vector that must be maintained by anything calling itself a vector? The lack of contiguous storage in tbb::concurrent_vector hadn’t occurred to me as being noteworthy – hence I wrote a blog post that was confusing. Having a name that makes sense to you is one thing, having someone else look at it helps identify any assumptions you’ve made that you didn’t know you were making.

There is a particular joy in coming up with a good name. On the odd occasions I experience this joy it is usually because I have forgotten (in the sense of rote learning) the name of something but I’ve been able to reconstruct what the name is. At the very least it shows some consistency – my brain is working the same way now as it was when I originally came up with the name. Sometimes the quality of a name doesn’t come to light until we try and name something that’s similar – if the original name is good, we can easily derive a name for the new, similar, thing.

I am sure that off by one errors will always be with us. The half open ranges used by many languages (including C and C++) make the situation better, but not perfect. My first job was programming in FORTRAN 77 where arrays start at index 1. I had to write more +1 and -1 statements in FORTRAN than I have ever had to in C and C++. Even better was when we were calling between C and FORTRAN. Yes, we knew the array indices were off by one, but it wasn’t always easy to remember whether it was +1 or -1.

I have mentioned this piece by Dijkstra before. As well as his theoretical analysis, he also mentions that the programming language Mesa allowed for all four different range conventions, and that the “half-open, start at zero” convention led to fewer errors.

Quote of the week – quality

… I am convinced that the quality of the product can never be established afterwards. Whether the correctness of a piece of software can be guaranteed or not depends greatly on the structure of the thing made. This means that the ability to convince users, or yourself, that the product is good, is closely intertwined with the design process itself.

Edsger Dijkstra

We don’t get quality by having a week, or even a two or four week, “quality sprint” at the end. We don’t get quality by just talking about quality. We don’t get quality by putting up motivational posters (although we did get a lot of laughs out of the wonderfully cynical graffiti that was added within 24 hours of their appearance).

We get quality by wanting it and engineering for it and caring about it at every stage in the process. We get quality by being prepared to admit that we’re doing something (or everything) wrong and finding a way to do it better. Quality software is bloody difficult to obtain even when we care about it deeply, follow all the industry best practices and invent a few best practices of our own. Quality software is impossible to obtain if we’re not doing everything we possibly can.

Dijkstra talks about the “ability to convince users, or yourself, that the product is good”. One of my first questions to a team is “does your software work”? The vast majority of teams will answer “yes”. Some of them might admit to a qualified “yes”. The next question is “how do you know your software works”? The answer to that question should be wide and deep. It should encompass testing of many flavours – unit, automated, user, white box, black box, integration, system. It should encompass development practices, communication structures, requirements gathering, change management and the attitude of everyone on the team.

The best team I was ever on had a white box QE who looked at every decision with a testing and quality mindset. Whenever we came up with a design for a component he asked how it was going to be tested, how we would know whether it worked or not (the definition of “works” can be surprisingly subtle) and what additional integration tests would be necessary because that component interacted with other components. There were times where that QE was the least popular person in the room. These were also the times when it was incredibly important to keep him in the room and make sure we were all listening to him. That in turn demanded a certain attitude from the rest of us.

Quote of the week – the org. chart

Don’t ship the org. chart

Steve Sinofsky

Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you’ll get a 4-pass compiler)

Conway’s Law

If you want a product with certain characteristics, you must ensure that the team has those characteristics before the product’s development.

Jim McCarthy and Michele McCarthy – Software for your Head

Remember Sinofsky’s “don’t ship the org chart”? It is a lie. You cannot avoid it. You always ship the org chart. So the real question is, what is the org. going to look like so that we ship something good-looking? That’s the real purpose of Steve’s rant. No matter how much politicking or boosting you do for this important service-oriented architecture, it doesn’t work unless you have a service-oriented org chart. And Google does not, apparently.

Unknown Amazon engineer on Hacker News

There are many more variations on “don’t ship the org. chart”. The quote I find interesting is the one from the anonymous Amazon engineer who takes a different approach – you will ship the org. chart, so make sure that the org. chart is the one you want to ship.

“Steve’s rant” refers to this post by Steve Yegge. I have never been sure how seriously I should take Steve Yegge’s claim that it was posted accidentally, but it’s a great read.

Quotes of the week – hardware vs. software

… it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!

Donald Knuth (Quoted here)

Computer science is no more about computers than astronomy is about telescopes.

Edsger Dijkstra (although disputed)

Real computer scientists despise the idea of actual hardware. Hardware has limitations, software doesn’t. It’s a real shame that Turing machines are so poor at I/O.

Anon

Much as I hate to disagree with Donald Knuth, I think he is being overly harsh to the hardware manufacturers (at least on this particular topic). Chips are now being produced with 14 nm features. Chip designers are faced with some fundamental limits of the universe: quantum effects and the speed of light matter at this scale. They are also having to deal with billions of transistors on a chip. I am always amazed that anything computer based works at all.

I used Donald Knuth’s quote at a talk Chuck Rose and I gave at a domain specific languages conference a few years ago. The talk can be found here. It goes into detail about Pixel Bender, a high performance language for image processing (i.e. very fast, very cool special effects on images), that can use as much parallelism as is available.

As usual, Dijkstra makes the rest of us look like we are banging our heads against the keyboard and accepting whatever results as a valid program. The University of Texas maintains an archive of Dijkstra’s manuscripts. (See also a previous quote of the week).

And finally, while “real computer scientists” might “despise the idea of actual hardware”, some of us have to deal with actual hardware to make a living. I am immensely grateful for the fact that we have top notch thinkers like Knuth and Dijkstra, and I am also grateful that we have engineers at the chip companies doing the hard work necessary to give us processors that work.

Inconsistent sorting

A few years ago I was working on a prototype for some 3D drawing code. I was using the painters algorithm – draw the objects from far to near so that the close objects will occlude the distant objects. I had my objects in a vector, I sorted them according to their distance from the camera, then drew the objects. Then I noticed something odd. When I first started the program I had two objects that would flicker. They would swap which one was in front – sometimes object A would be drawn in front of object B, sometimes B was drawn in front of A.

My start up coordinates for the camera and the objects meant that A and B were exactly the same distance from the camera, so when I did the sort, A and B would compare equal. Since I wasn’t using std::stable_sort I wasn’t expecting any particular ordering for A and B, but I was expecting the ordering to be consistent between calls to std::sort. After some debugging I discovered that giving the exact same input to std::sort would not always give the same output – objects with identical distances might be sorted differently on each call.

I ended up digging into the standard library implementation. I think I was using Metrowerks C++, but I wouldn’t bet on it. The implementation of std::sort used Quicksort – nothing unusual there. One of the problems with Quicksort though is that you have to make good choices of pivot element – the element that you use to to split the input into two partitions. For example, if you use the first element as the pivot and hand Quicksort an already sorted sequence the runtime ends up being O(n^2). Quicksort has great average case behaviour but bad worst case behaviour. There are various techniques for picking the pivot element including:

  • first element
  • last element
  • middle element
  • median of the first, last and middle elements

This particular implementation had found a different method of picking the pivot – it used rand. Effective since the chances of providing the sort with a worst case sequence were low, but it led to this odd behaviour – the same input to std::sort would not necessarily lead to the same output. I switched to std::stable_sort and all my problems went away.

As far as I can tell from the standard (section 25.4.1.1), this behaviour for std::sort is entirely legal. std::sort does not guarantee the order in which equivalent elements are placed, and I don’t see anything that forces std::sort to be consistent. I have never found this implementation in any other standard library, but I have heard of one person who was caught out like I was.

Quote of the week – what’s important?

How the teacher reacts when something goes wrong tells the class what’s important.

Bruce Hamilton

Bruce is a dance teacher par excellence and originally said this in the context of teaching Scottish and English folk dancing. It also applies to any sort of teacher / student interaction, and any interaction where someone has power or influence over someone else. Let’s paraphrase the quote:

How the manager reacts when something goes wrong tells the team what’s important.

or perhaps:

How the senior developer reacts when something goes wrong tells the junior developers what’s important.

I have had the misfortune to work on several teams where one person (usually, but not always, the manager) had to have their ego maintained at all costs. If anything went wrong it was not their fault, and any attempt to get them to take responsibility was doomed to (very unpleasant) failure. It was very clear what was most important – their ego. Because of that, entry #2 on the importance scale was for the rest of the team to protect themselves from the egoist. If we were lucky, entry #3 might involve producing a quality product.

std::vector vs. tbb::concurrent_vector – API design

My current home project does a lot of processing and since I paid money for a 4 core processor I would like to do as much of that in parallel as possible. I have been using Intel’s Threading Building Blocks (TBB) for a while now and although some of what TBB does has been taken over by the thread library in C++11 there are still some useful features in TBB. In particular, TBB provides a set of concurrent containers:

  • map
  • set
  • hashmap
  • queue
  • bounded queue
  • priority queue
  • vector

One of the interesting things about the concurrent containers is that they don’t just consist of a clever implementation behind the same API as our regular containers, the API itself needs to be different. std::vector has been around for 20 years or so. I have made plenty of use of it, and written some vector-like containers (usually because I want different speed / space tradeoffs) that have used the same API as std::vector so they can be drop in replacements. However, once we start trying to create a concurrent version of a vector we run into problems with the API. In this post I am going to compare a C++98 vector with a TBB 4.2 (update 2) concurrent_vector and see what changes Intel have made to the API.

(Disclaimer: I have no insider knowledge of the Intel’s design process for TBB, these are simply my own thoughts based on using TBB, reading Intel’s documentation and reading James Reinders’ book “Intel Threading Building Blocks”)

I highly recommend reading Eric Lippert’s post What is this thing you call “thread safe”?. Eric’s conclusion is that “thread safe” is a vague term and we need to get much more specific about the behaviours that we claim are “thread safe”. For this post I am going to define “thread safe” as meaning “has an API that allows us to do useful things on multiple threads simultaneously”.

What if we do the simplest and most naïve thing possible? Let’s just take our implementation of std::vector, add a mutex and put a lock around the implementation of every method. We might argue that the vector is now “thread safe”. Depending on our definition of “thread safe”, we might even be correct. What we have not done is create something that’s useful. Take this simple piece of code:

void fn( std::vector< int >& v_i )
{
    v_i.push_back( 7 );
}

Assume that v_i is shared across threads. Once we have called push_back( 7 ) we know nothing. We don’t know what the index of the element “7” is – we can’t use the size of the vector before or after the push_back – another thread might have changed the size of the vector, or deleted the element that we just pushed. We don’t even know whether our “7” is still in the vector (yes, we can take steps to write our multi threaded code to ensure this doesn’t happen, but the whole point of trying to create a thread safe vector is to make it easier for us to access it from multiple threads).

This leads to a principle of concurrent APIs – if you have data and/or operations that are connected you usually cannot access them separately. A single threaded vector’s push_back method does not return a value – if we want to know where the element is in the vector we can just use size() – 1. A concurrent vector’s push_back must return the position of the newly appended element.

Even if we add a return value to push_back we haven’t solved the problem of another thread changing the size of the vector. We call push_back, we know where the element we just pushed ended up, but we do not know if it is still there when we come to use it. Erasing elements and inserting elements in the middle of the vector are problematic.

We can’t even access anything in the vector safely – we can check that our index is less than the size of the vector before we call operator [], but the size of the vector might change between our call to size and our call to operator []. We have to add range checking to operator [] and throw an exception if the index we pass in is out of range. This gives us another principle of concurrent APIs – many more methods have to be capable of returning errors – the concurrent nature of the API makes it difficult or impossible to guarantee that the values passed to a method will be valid when the method actually executes.

Accessing the vector via iterators is no help either, another thread can come along, push_back a few elements and cause the vector to resize and reallocate, invalidating all iterators. The thread that is using the iterators has no way to guard against this.

On top of all of this, just putting a lock around every function is unlikely to give us good performance. Using the vector now becomes a bottleneck even if we are just reading the vector – remember that we are doing this for performance reasons.

Let’s see what Intel did to avoid of these problems.


Access

reference operator[]( size_type index )
const_reference operator[]( size_type index ) const

reference at( size_type index )
const_reference at( size_type index ) const

reference front()
const_reference front() const

reference back()
const_reference back() const

I think it’s safe to say that the bare minimum we can expect from a concurrent vector is that its elements can be safely accessed from multiple threads. We have a set of access methods that are very similar to those in std::vectoroperator [] does not throw an exception, at might throw an exception. The access methods can be used concurrently.


Concurrent growth

iterator push_back(const_reference item )

iterator grow_by( size_type delta );
iterator grow_by( size_type delta, const T& t );
template<typename ForwardIterator> 
iterator grow_by( ForwardIterator first, ForwardIterator last );

iterator grow_to_at_least( size_type n )
iterator grow_to_at_least( size_type n, const T& t )

There are three methods (plus a couple of variations) for appending new elements to the vector.

push_back is the function we know and love from std::vector, however it now returns an iterator to the newly appended element – when we push_back an element we will know where it has been placed.

grow_by appends a certain number of elements to the end of the vector – depending on which overloaded function we use we get default constructed elements, copies of a single element or a new set of elements from a given source sequence. It returns an iterator pointing to the start of the appended sequence.

grow_to_at_least makes sure we have at least n elements in the vector, and if necessary appends default constructed elements or copies of a given element. Again, it returns an iterator pointing to the start of the appended sequence.

As with the access methods, the growth methods can be used concurrently.

It looks like we have covered the basic properties of a vector however, even if we just use the accessors and the growth methods, there is the potential for problems. The Intel documentation states:

The methods described in this section [the access methods] may be concurrently invoked on the same vector as methods for concurrent growth. However, the returned reference may be to an element that is being concurrently constructed.

I.e. if you’re appending a sequence of elements on one thread and trying to access that sequence from another thread you might get back a reference to a partially constructed element. That seems strange. Let’s work out what is going on here by taking a look at the size method.


Size

size seems relatively benign, but the documentation tells us that size might include elements that have had their memory allocated but are still under construction by another thread. This is the same issue we saw earlier with the access methods – you can have a reference to a not-yet-constructed object. What’s going on?

I don’t know for sure why Intel made size work this way, but I can speculate. I’ll use a common technique – I’ll assume the opposite and see what happens. Let’s say that size must reflect the number of fully constructed objects – it will not include partially constructed objects.

Imagine that thread A uses grow_by to append 50 objects to the thread, while thread B appends 1000 objects. Assume that thread A “wins” – it gets access to the memory first, allocates space for 50 objects and starts constructing them. While thread A is constructing its objects, thread B can allocate space for its 1000 objects (because the whole point is to be able to do this concurrently) and thread B starts constructing its objects. Thread A finishes first (it only has 50 elements to construct), it updates size to reflect the additional 50 objects. Thread B finishes and it then updates size to reflect its 1000 objects. Everyone is happy.

What if things go differently? Let’s say that thread B “wins”, allocates space for its 1000 objects and starts constructing them. Thread A then allocates space for its 50 objects and starts constructing them. Thread A finishes first (only 50 objects). What should it update size to? Remember that in this hypothetical situation we want size to reflect the number of fully constructed objects. Unless thread A is prepared to wait until thread B has finished constructing its objects (which would kill our concurrent performance) it cannot change size because that would also include the partially constructed objects that thread B is working on. Since thread A cannot change size, thread B must change it when thread B has finished constructing its objects. Thread B needs to add on the number of elements it just appended and the number of elements that thread A appended. That’s going to add a bunch of complication internally to make sure that everything gets updated correctly, and I haven’t even started on threads C through Z which might also be growing and accessing the vector.


What isn’t there?

Notice what we haven’t seen. There are no erase methods and no insert methods. You cannot remove any elements from the vector (with the exception of clear which cannot be used concurrently) and you cannot insert elements in the middle – that means that you can’t change the position of any element once it is in the vector. There is only one place you can add elements to a concurrent vector – at the end.

Concurrent vector has the usual begin / end / rbegin / rend methods for getting iterators. It also guarantees that elements will never be moved – concurrent vector does not do the std::vector trick of allocating new memory and copying everything over (and invalidating iterators). Couple that with the lack of erase methods and you get the property that once you have a valid index, reference or iterator to an element, that index, reference or iterator will stay valid until the concurrent vector is cleared or destroyed. This means that one thread cannot invalidate the references on another thread – the API is giving us some safety that we didn’t have before.


Capacity

tbb::concurrent_vector has the normal operations around size and capacity:

size_type size() const
bool empty() const
size_type capacity() const
size_type max_size() const

To finish off the API there are the usual mixture of constructors and a destructor, none of which can be used concurrently (and it is possible to call constructors concurrently if you’re using placement new or have some half baked singleton implementation). There’s assignment (both operator = and assign methods), a specialization of swap, and a shrink_to_fit method (to match C++11’s shrink_to_fit method). There are also some parallel iteration methods (range) which make the concurrent vector easily usable in Intel’s concurrent algorithms.

So what’s the conclusion? Intel have stripped a lot of the functionality from std::vector but left in the core operations – growing the vector by appending elements, and accessing them. Even with that limited functionality it is still possible to do things incorrectly, however a little discipline around setup and tear down helps mitigate this. My code has an initial sequence that is single threaded, then it kicks off several threads, and on shutdown it ends up back on a single thread. I make sure that the non-concurrent methods are only called from a single thread.

Concurrent programming is hard to begin with, the fact that we need to rethink our APIs just makes it harder.