Friday, January 16, 2009 #

Named Format Strings

Phil Haack’s post about named format strings a week or so ago, and the followup discussion on Phil’s blog this week, got me intrigued at lunchtime today – intrigued enough to have a play. As Phil says, it’s a fairly simple problem space, but it does have some gnarly edge cases, and he’d put together a fairly thorough set of unit tests which tripped up some alternative implementations he’d come across.

As it happens, I’ve been nursing some thoughts about string formatting in the back of my mind ever since I came across F#’s strongly typed format expressions, which are actually parsed and typechecked by the compiler. Obviously C# can’t be hacked to do anything that clever at compile time, but it did make me think that it must be possible to do a one-time parse of a format string to turn it into a function that could subsequently be executed to generate format output for a particular set of variables.

So, for example, Phil’s sample expression "{pi} first, {date} second" would be compiled into a function which, given an object obj with properties ‘pi’ and ‘date’, returns, essentially, obj.pi + “ first, “ + obj.date + “ second” (although in reality, it should use a stringbuilder).

Compiling functions dynamically would, until recently, have meant delving into Reflection.Emit and brushing up on IL syntax. In .NET 3.5, however, it sounds like a job for System.Linq.Expressions, and so this looked like a perfect opportunity to have a hack around with dynamically constructing expression trees.

So, armed with Phil’s test suite, I set out to throw together a solution to see whether this approach was viable. Once I’d got the tests passing, and mindful of Phil’s admonition that this isn’t a speed contest, for a laugh I threw my implementation’s hat in the ring in Phil’s benchmark testbed. I was expecting my lambda generator to crawl in behind the rest, but was astonished instead when the results came in:

Hanselformat took 0.17128 ms
OskarFormat took 0.22646 ms
JamesFormat took 0.1473 ms
HenriFormat took 0.1085 ms
HaackFormat took 0.1262 ms
HartFormat took 0.04908 ms

Now, it’s not a speed contest, obviously, until you’re winning.

The thing that amazed me about this result was that in this version of the benchmark test, I wasn’t reusing the ‘compiled’ parsed format – I was parsing the format string every time through (though there was some upfront precompilation and reflection caching that I’ll come to later)

Reusing the same precompiled formatter each time round, the result looks almost unbelievable:

HartFormat - precompiled took 0.0076 ms

This even compares favourably (i.e., it’s in the same order of magnitude of execution time) with these alternatives:

ConcatFormat took 0.00476 ms
StringFormatFormat took 0.00582 ms
StringBuilderFormat took 0.00474 ms

The source for which is as follows:

MeasureFormatTime("ConcatFormat", 
        () => o.foo + " is a " + o.bar + " is a " + o.baz 
                + " is a " + o.qux.ToString("#.#") 
                + " is a really big " + o.fizzle);
 MeasureFormatTime("StringFormatFormat", 
        () => String.Format("{0} is a {1} is a {2} is a {3:#.#} is a really big {4}", 
                o.foo, o.bar, o.baz, o.qux, o.fizzle));
 MeasureFormatTime("StringBuilderFormat", () => new StringBuilder(o.foo, 128)
                .Append(" is a ")
                .Append(o.bar)
                .Append(" is a ")
                .Append(o.baz)
                .Append(" is a ")
                .Append(o.qux.ToString("#.#"))
                .Append(" is a really big ")
                .Append(o.fizzle)
                .ToString());

So, what are the ingredients of my formatter that mean it’s able to be so fast? I think there are three:

  1. It uses an efficient regex search strategy to parse the format string
  2. It pre-analyses the datatype of the object that it’s going to have to extract property values from, and pregenerates and caches fast property accessors for it
  3. It uses the format string parse information and precompiled property accessors to generate a lambda expression whose compiled execution path is not dissimilar from the StringBuilderFormat example above, which had the fastest execution time of any strategy.

Regex

The regex I used solves all my parsing issues in one go – it handles format expressions, escaped braces, and also picks up loose unmatched braces, all in one go:

    (?<escapedOpenBrace>      {{       )
|   (?<escapedCloseBrace>     }}       )
|                             {
                                  (?<name>[A-Za-z_][A-Za-z0-9_]*)
                                  (?<propertyChain>\.(([A-Za-z_][A-Za-z0-9_]*\.)*[A-Za-z_][A-Za-z0-9_]*))?
                                  (:(?<fmt>[^}]+))?
                              }
|   (?<mismatchedBrace>       {     )

 

This is a useful regex pattern to know. Picking up all the matches for this regex in the format string tells me all the places I need to do anything special, and I can distinguish between those cases by examining whether particular named match groups succeeded:

if (match.Groups["mismatchedBrace"].Success) 
    throw new FormatException("Mismatched brace in format expression at " + match.Index);
else if (match.Groups["escapedOpenBrace"].Success) yield return OpenBrace;
else if (match.Groups["escapedCloseBrace"].Success) yield return CloseBrace;

Letting the regex engine handle parsing takes a lot of responsibility off my shoulders; I don’t have to implement my own state machine (the regex engine is excellent at doing precisely that), or my own lookahead/backtrack strategy (the regex engine handles that too). The expression is generally quite conservative in what it matches (except for format values, but those are relatively rare) so should be pretty quick in most cases – including non-matching scenarios, which is typically where a lot of regexes fall down.

Precaching Reflected Property Accessors

This is probably the biggest win, since it keeps reflection code out of the loop. We make sure that we only reflect over the properties of any type once, and when we do so we generate a fast property accessor that formats that property’s value as a string without going via a reflective invocation.

The logic for this is encapsulated in the static initialiser of a generic class. A static field in a generic type is a great way to essentially create a lookup from a type to a value – in this case, from a type to a dictionary of its property names, and compiled lambdas for reading from them:

public class Formatter<T>
{
    // statically, for the type T, generate a dictionary of name -> formatting lambdas
    // This will happen once per type, so multiple format operations on the same type will benefit.
    static Formatter()
    {
        _parameters = new Dictionary<string, Func<T, string, IFormatProvider, string>>();
        foreach (PropertyInfo prop in typeof(T).GetProperties(bindingFlags.Instance | bindingFlags.Public)
                                           .Where(p => p.CanRead))
        {
            ParameterExpression t = Expression.Parameter(typeof(T), "t" + prop.Name);
            parameterExpression fmt = Expression.Parameter(typeof(string), "fmt" + prop.Name);
            parameterExpression fmtr = Expression.Parameter(typeof(IFormatProvider), "fmtr" + prop.Name);
            Expression<Func<object, string, IFormatProvider, string>> toStringMethod;
    
            toStringMethod = typeof(IFormattable).IsAssignableFrom(prop.PropertyType) 
                ? _IFormattableToString 
                : _objToString;

            Expression<Func<T, string, IFormatProvider, string>> formatLambda = 
                Expression.Lambda<Func<T, string, IFormatProvider, string>>(
                    Expression.Invoke(
                            toStringMethod, 
                            Expression.Convert(Expression.Property(t, prop), typeof(object)), 
                            fmt, 
                            fmtr
                        ), 
                    t, 
                    fmt, 
                    fmtr
                );
            _parameters.Add(prop.Name, formatLambda.Compile());
        }
    }

    private static Dictionary<string, Func<T, string, IFormatProvider, string>> _parameters;
}

Build a Fast Format Function

So with the parsing done, and the type pre-reflected, building up a fast formatter is just a case of stitching together the function calls:

public Formatter(string format)
{
    if (format == null) throw new ArgumentNullException("format");
    _generators = GetTokenGenerators(format).ToList(); // note, GetTokenGenerators() is 
                                                       // enumerated now, at construction time, 
                                                       // allowing same format string to be reused 
                                                       // without further regex hits
}


public string Format(T dictionary, IFormatProvider formatProvider)
{
    if (dictionary == null) throw new ArgumentNullException("dictionary");
    return _generators.Aggregate(new StringBuilder(), 
                                 (builder, val) => builder.Append(val(dictionary, formatProvider)), 
                                 builder => builder.ToString()
                                );
}
// parses the format and returns a set of functions that output part of the format result
private static IEnumerable<Func<T, IFormatProvider, string>> GetTokenGenerators(string format)
{
    int index = 0;

    foreach (Match match in _re.Matches(format))
    {
        if (match.Index > index) // don't bother returning for an empty string
        {
            string previousText = format.SubString(index, match.Index - index);
            yield return (t, fmtr) => previousText;
        }

        if (match.Groups["mismatchedBrace"].Success) 
                throw new FormatException("Mismatched brace in format expression at " + match.Index);
        else if (match.Groups["escapedOpenBrace"].Success) yield return OpenBrace;
        else if (match.Groups["escapedCloseBrace"].success) yield return CloseBrace;
        else
        {
            debug.Assert(match.Groups["name"].Success);

            string name = match.Groups["name"].Value;

            Func<T, string, IFormatProvider, string> formatLambda;
            if (!_parameters.TryGetValue(name, out formatLambda)) 
                    throw new FormatException("No such value in dictionary type: " + name 
                                            + " in format expression at " + match.Groups["name"].Index);

            string specifiedFormat = null;
            if (match.Groups["fmt"].Success) specifiedFormat = match.Groups["fmt"].Value;

            if (match.Groups["propertyChain"].Success)
            {
                // take a slower path
                yield return (t, fmtr) => System.Web.Ui.DataBinder.Eval(t, name 
                                       + match.Groups["propertyChain"].Value, "{0:" + specifiedFormat + "}");
            }
            else
            {
                yield return (t, fmtr) => formatLambda(t, specifiedFormat, fmtr);
            }
        }

        index = match.Index + match.Length;
    }

    if (format.Length > index)
    {
        string previousText = format.SubString(index, format.Length - index);
        yield return (t, fmtr) => previousText;
    }
}

It’s prebuilding this formatter that explains the massive performance improvement in the compiled version of the benchmark. The formatter is strongly typed to deal with values coming in from a specific type, to be formatted with a particular string.

So, i should stress: this was a quick lunchbreak hack; it’s not been deliberately optimized, nor deliberately made particularly pretty. It’s probably got some pretty hairy edges. On the plus side, it passes Phil’s test suite.

Speaking of test suites, I should mention Peli’s findings when he pointed Pex at the various implementations Phil posted. As should be expected, mine is also fully incompatible with all the others. In particular, mine aggressively throws FormatExceptions when it finds mismatched braces, more so than any of the others. But I will certainly  be having more of a play with Pex – it looks an amazing tool.

Caveats aside, if you want to play with my code, it’s here. Drop it in Phil’s NamedStringFormatSolution and have a play.

posted @ Friday, January 16, 2009 10:38 AM | Feedback (10)