dprintf

I've recently gone from being indifferent to varargs in C++ to thinking that they should be avoided at all cost. It's a collection of values (but not references or objects passed by value) that you don't know the length or type of, that you can't pass on directly to other variadic functions, and that can only be streamed over not accessed randomly. Bleh! In the case of printf-like function, which I expect is the most popular use of varargs, you add the information that is missing from the arguments themselves to the format string. This gives ample opportunity for mistakes, which some compilers will warn you of but which other's won't. For me, the conclusion has been that varargs are harmful and should not be used under any circumstances.

Of course, not having printf-like functions in C++ is a drag if you want to build strings so I've been trying to come up with something to replace it that isn't too painful. What the remainder of this post is about is that you can actually have your cake and eat it too: you can have something that looks just like printf that doesn't use varargs but takes a variable (but bounded) number of arguments where the arguments each carry their own type information and can be accessed randomly. Where you would write this using classic printf:

int position = ...;
const char *value = ...;
printf("Found %s at %i", value, position);

this is how it looks with dprintf (dynamically typed printf):

dprintf("Found % at %", args(value, position))

You'll notice that dprintf is different from printf in that:

There is no type information in the format string; arguments carry their own type info. Alternatively you could allow type info in the format string and use the info carried by the arguments for checking.
The arguments are wrapped in a call to args. This is the price you have to pay syntactically.

To explain how this works I'll take it one step at a time and start with how to automatically collect type information about arguments.

Type Tagging

When calling dprintf each argument is coupled with information about its type. This is done using implicit conversion and static overloading. If you pass an argument to a function in C++ and the types don't match then C++ will look at the expected argument types and, if possible, implicitly convert the argument to the expected type. This implicit conversion can be used to collect type information:

enum TypeTag { INT, C_STR, STR, FLOAT, ... }

class TaggedValue: {
public:
  TaggedValue(int v) : tag(INT) { value.u_int = v; }
  TaggedValue(const char* v) : tag(C_STR) { value.u_c_str = v; }
  TaggedValue(const string &v) : tag(STR) { value.u_str = v; }
  ...
private:
  TypeTag tag;
  union {
    int u_int;
    const char* u_c_str;
    const string *u_str;
  } value;
}

void call_me(const TaggedValue &value) {
  printf("Type tag: %i\n", value.tag());
}

call_me(4); // Prints the value of INT
call_me("foo"); // Prints the value of C_STR

When calling call_me the arguments don't match so C++ implicitly finds the int and const char* constructors of TaggedValue and calls them. Each constructor in TaggedValue does a minimal amount of processing to store the value and type info about the value. The important part here is that you don't see this when you invoke call_me so as a user you don't have to worry about how this works, all you know is that type info is somehow being collected.

This has the limitation that it only works for a fixed set of types, the types for which there is a constructor in TaggedValue -- but on the other hand this is no different from the fixed set of formatting directives understood by printf. On the plus side it allows you to pass any value by reference and by value as long as TaggedValue knows about it in advance. The issue of automatic type tagging is actually an interesting one; in this case we only allow a fixed set of "primitive" types but there is no reason why tagging can't work for array types or other even more complex types^[1].

The next step is to allow a variable number of these tagged values. Enter the args function.

Variable argument count

This is the nastiest part of it all since we have to define an args function for every number of arguments we want to allow. This function will pack the arguments up in an object and return it to the dprintf call:

// Abstract superclass of the actual argument collection
class TaggedValueCollection {
public:
  TaggedValueCollection(int size) : size_(size) { }
  int size() { return size_; }
  virtual const TaggedValue &get(int i) = 0;
private:
  int size_;
}

template <int n>
class TaggedValueCollectionImpl : public TaggedValueCollection {
public:
  TaggedValueCollectionImpl(int size)
    : TaggedValueCollection(size) { }
  virtual const TaggedValue &get(int i) {
    return *(elms_[i]);
  }
  const TaggedValue *elms_[n];
}

static TaggedValueCollectionImpl<3> args(const TaggedValue &v1,
    const TaggedValue &v2, const TaggedValue &v3) {
  TaggedValueCollectionImpl<3> result(3);
  result.elms_[0] = &v1;
  result.elms_[1] = &v2;
  result.elms_[2] = &v3;
  return result;
}

The TaggedValueCollectionImpl holds the arguments and provides access to them. The TaggedValueCollection superclass are there to allow methods to access the arguments without knowing how many there are. The number of arguments are part of the type of TaggedValueCollectionImpl so we can't use that directly. The args function is the one that handles 3 arguments but they all look like this. It creates a new collection and stores its arguments in it. While it is bad to have to define a function for each number of arguments it is much better than varargs. Also, you only have to write these functions once, and then every function with printf-like behavior can use them.

Finally, the dprintf function looks like this:

void dprintf(string format, const TaggedValueCollection &elms) {
  // Do the formatting
}

One advantage about having random access to the arguments is that the format string can access them randomly, they don't have to be accessed in sequence. I've used the syntax $i to access the i'th argument:

dprintf("Reverse order (2: $2, 1: $1, 0: $0)", args("a", "b", "c"))
// -> prints "Reverse order (2: c, 1: b, 0: a)"

Coupled with the fact that format strings don't have to contain type info this gives a lot more flexibility in dealing with format strings. Honestly I've never had any use for this but I could imagine that it could be useful for instance in internationalization where different languages have different word order.

To sum up, the advantages of this scheme are:

Type information is automatically inferred from and stored with the arguments.
You can pass any type of value by reference or by value.
The argument collections (the const TaggedValueCollection &s) can be passed around straightforwardly. You only have to define one function, there is no need for functions like vprintf.
Format strings can access arguments in random order.

I've switched to using this in my hobby project and it works like a charm. You can see a full implementation of this here: string.h and string.cc (see string_buffer::printf and note that the classes have different names than I've used here). To see how it is being used see conditions.cc for instance.

A final note: if you find this kind of thing interesting (and since you've come this far I expect you do) you'll probably enjoy enjoy Danvy's Functional Unparsing.

¹: To represent more complex types you just have to be a bit more creative with how you encode the type tag. A simple scheme is to to let the lowest nibble specify the outermost type (INT, C_STR, ARRAY, ...) and then, if the outermost type is ARRAY let the 28 higher bits hold the type tag of the element type. For more complex types, like pair the 28 remaining bits can be divided into two 14-bit fields that hold the type tags for T1 and T2 respectively. For instance, pair<const char*, int**> would be represented as

bits 28-31 24-27 20-23 16-19 12-15 8-11 4-7 0-3

data c_str int array array pair

While this scheme is easy to encode and decode it can definitely be improved upon. For instance, there are many type tag values that are "wasted" because they don't correspond to a legal type. Also, it allows you to represent some types that you probably don't need very often while some types you are more likely to use cannot be represented. For instance, you can represent int****** but not pair<int, pair<int*, const char*>>. And of course, if you don't have exactly 16 different cases you're guaranteed to waste space in each nibble. But finding a better encoding is, as they say, the subject of future research.

bits	28-31	24-27	20-23	16-19	12-15	8-11	4-7	0-3
data			c_str		int	array	array	pair