As part of the project I'm currently involved in at university, I started (re)writing a Pin tool to gather run-time traces of applications parallelized with OpenMP. This tool has to support two modes: one to generate a single trace for the whole application and one to generate one trace per parallel region of the application.

In the initial versions of my rewrite, I followed the idea of the previous version of the tool: have a -split flag in the frontend that enables or disables the behavior described above. This flag was backed by an abstract class, Tracer, and two implementations: PlainTracer and SplittedTracer. The thread-initialization callback of the tool then allocated one of these objects for every new thread and the per-instruction injected code used a pointer to the interface to call the appropriate specialized instrumentation routine. This pretty much looked like this:
void
thread_start_callback(int tid, ...)
{
if (splitting)
tracers[tid] = new SplittedTracer();
else
tracers[tid] = new PlainTracer();
}

void
per_instruction_callback(...)
{
Tracer* t = tracers[PIN_ThreadId()];
t->instruction_callback(...);
}
I knew from the very beginning that such an implementation was going to be inefficient due to the pointer dereference at each instruction and the vtable lookup for the correct virtual method implementation. However, it was a very quick way to move forward because I could reuse some small parts of the old implementation.

There were two ways to optimize this: the first one involved writing different versions of per_instruction_callback, one for plain tracing and the other for splitted tracing, and then deciding which one to insert depending on the flag. The other way was to use template metaprogramming.

As you can imagine, this being C++, I opted to use template metaprogramming to heavily abstract the code in the Pin tool. Now, I have an abstract core parametrized on the Tracer type. When instantiated, I provide the correct Tracer class and the compiler does all the magic for me. With this design, there is no need to have a parent Tracer class — though I'd welcome having C++0x concepts available —, and the callbacks can be easily inlined because there is no run-time vtable lookup. It looks something like this:
template< class Tracer >
class BasicTool {
Tracer* tracers[MAX_THREADS];

Tracer* allocate_tracer(void) const = 0;

public:
Tracer*
get_tracer(int tid)
{
return tracers[tid];
}
};

class PlainTool : public BasicTool< PlainTracer > {
PlainTracer*
allocate_tracer(void) const
{
return new PlainTracer();
}

public:
...
} the_plain_tool;

// This is tool-specific, non-templated yet.
void
per_instruction_callback(...)
{
the_plain_tool.get_tracer(PIN_ThreadId()).instruction_callback(...);
}
What this design also does is force me to have two different Pin tools: one for plain tracing and another one for splitted tracing. Of course, I chose it to be this way because I'm not a fan of run-time options (the -split flag). Having two separate tools with well-defined, non-optional features makes testing much, much easier and... follows the Unix philosophy of having each tool do exactly one thing, but doing it right!

Result: around a 15% speedup. And C++ was supposed to be slow? ;-) You just need to know what the language provides you and choose wisely. (Read: my initial, naive prototype had a run-time of 10 minutes to trace part of a small benchmark; after several rounds of optimizations, it's down to 1 minute and 50 seconds to trace the whole benchmark!)

Disclaimer: The code above is an oversimplification of what the tool contains. It is completely fictitious and obviates many details. I will admit, though, that the real code is too complex at the moment. I'm looking for ways to simplify it.