Named Format Strings

Phil Haack’s post about named format strings a week or so ago, and the followup discussion on Phil’s blog this week, got me intrigued at lunchtime today – intrigued enough to have a play. As Phil says, it’s a fairly simple problem space, but it does have some gnarly edge cases, and he’d put together a fairly thorough set of unit tests which tripped up some alternative implementations he’d come across.

As it happens, I’ve been nursing some thoughts about string formatting in the back of my mind ever since I came across F#’s strongly typed format expressions, which are actually parsed and typechecked by the compiler. Obviously C# can’t be hacked to do anything that clever at compile time, but it did make me think that it must be possible to do a one-time parse of a format string to turn it into a function that could subsequently be executed to generate format output for a particular set of variables.

So, for example, Phil’s sample expression "{pi} first, {date} second" would be compiled into a function which, given an object obj with properties ‘pi’ and ‘date’, returns, essentially, obj.pi + “ first, “ + obj.date + “ second” (although in reality, it should use a stringbuilder).

Compiling functions dynamically would, until recently, have meant delving into Reflection.Emit and brushing up on IL syntax. In .NET 3.5, however, it sounds like a job for System.Linq.Expressions, and so this looked like a perfect opportunity to have a hack around with dynamically constructing expression trees.

So, armed with Phil’s test suite, I set out to throw together a solution to see whether this approach was viable. Once I’d got the tests passing, and mindful of Phil’s admonition that this isn’t a speed contest, for a laugh I threw my implementation’s hat in the ring in Phil’s benchmark testbed. I was expecting my lambda generator to crawl in behind the rest, but was astonished instead when the results came in:

Hanselformat took 0.17128 ms
OskarFormat took 0.22646 ms
JamesFormat took 0.1473 ms
HenriFormat took 0.1085 ms
HaackFormat took 0.1262 ms
HartFormat took 0.04908 ms

Now, it’s not a speed contest, obviously, until you’re winning.

The thing that amazed me about this result was that in this version of the benchmark test, I wasn’t reusing the ‘compiled’ parsed format – I was parsing the format string every time through (though there was some upfront precompilation and reflection caching that I’ll come to later)

Reusing the same precompiled formatter each time round, the result looks almost unbelievable:

HartFormat - precompiled took 0.0076 ms

This even compares favourably (i.e., it’s in the same order of magnitude of execution time) with these alternatives:

ConcatFormat took 0.00476 ms
StringFormatFormat took 0.00582 ms
StringBuilderFormat took 0.00474 ms

The source for which is as follows:

MeasureFormatTime("ConcatFormat", 
        () => o.foo + " is a " + o.bar + " is a " + o.baz 
                + " is a " + o.qux.ToString("#.#") 
                + " is a really big " + o.fizzle);
 MeasureFormatTime("StringFormatFormat", 
        () => String.Format("{0} is a {1} is a {2} is a {3:#.#} is a really big {4}", 
                o.foo, o.bar, o.baz, o.qux, o.fizzle));
 MeasureFormatTime("StringBuilderFormat", () => new StringBuilder(o.foo, 128)
                .Append(" is a ")
                .Append(o.bar)
                .Append(" is a ")
                .Append(o.baz)
                .Append(" is a ")
                .Append(o.qux.ToString("#.#"))
                .Append(" is a really big ")
                .Append(o.fizzle)
                .ToString());

So, what are the ingredients of my formatter that mean it’s able to be so fast? I think there are three:

  1. It uses an efficient regex search strategy to parse the format string
  2. It pre-analyses the datatype of the object that it’s going to have to extract property values from, and pregenerates and caches fast property accessors for it
  3. It uses the format string parse information and precompiled property accessors to generate a lambda expression whose compiled execution path is not dissimilar from the StringBuilderFormat example above, which had the fastest execution time of any strategy.

Regex

The regex I used solves all my parsing issues in one go – it handles format expressions, escaped braces, and also picks up loose unmatched braces, all in one go:

    (?<escapedOpenBrace>      {{       )
|   (?<escapedCloseBrace>     }}       )
|                             {
                                  (?<name>[A-Za-z_][A-Za-z0-9_]*)
                                  (?<propertyChain>\.(([A-Za-z_][A-Za-z0-9_]*\.)*[A-Za-z_][A-Za-z0-9_]*))?
                                  (:(?<fmt>[^}]+))?
                              }
|   (?<mismatchedBrace>       {     )

 

This is a useful regex pattern to know. Picking up all the matches for this regex in the format string tells me all the places I need to do anything special, and I can distinguish between those cases by examining whether particular named match groups succeeded:

if (match.Groups["mismatchedBrace"].Success) 
    throw new FormatException("Mismatched brace in format expression at " + match.Index);
else if (match.Groups["escapedOpenBrace"].Success) yield return OpenBrace;
else if (match.Groups["escapedCloseBrace"].Success) yield return CloseBrace;

Letting the regex engine handle parsing takes a lot of responsibility off my shoulders; I don’t have to implement my own state machine (the regex engine is excellent at doing precisely that), or my own lookahead/backtrack strategy (the regex engine handles that too). The expression is generally quite conservative in what it matches (except for format values, but those are relatively rare) so should be pretty quick in most cases – including non-matching scenarios, which is typically where a lot of regexes fall down.

Precaching Reflected Property Accessors

This is probably the biggest win, since it keeps reflection code out of the loop. We make sure that we only reflect over the properties of any type once, and when we do so we generate a fast property accessor that formats that property’s value as a string without going via a reflective invocation.

The logic for this is encapsulated in the static initialiser of a generic class. A static field in a generic type is a great way to essentially create a lookup from a type to a value – in this case, from a type to a dictionary of its property names, and compiled lambdas for reading from them:

public class Formatter<T>
{
    // statically, for the type T, generate a dictionary of name -> formatting lambdas
    // This will happen once per type, so multiple format operations on the same type will benefit.
    static Formatter()
    {
        _parameters = new Dictionary<string, Func<T, string, IFormatProvider, string>>();
        foreach (PropertyInfo prop in typeof(T).GetProperties(bindingFlags.Instance | bindingFlags.Public)
                                           .Where(p => p.CanRead))
        {
            ParameterExpression t = Expression.Parameter(typeof(T), "t" + prop.Name);
            parameterExpression fmt = Expression.Parameter(typeof(string), "fmt" + prop.Name);
            parameterExpression fmtr = Expression.Parameter(typeof(IFormatProvider), "fmtr" + prop.Name);
            Expression<Func<object, string, IFormatProvider, string>> toStringMethod;
    
            toStringMethod = typeof(IFormattable).IsAssignableFrom(prop.PropertyType) 
                ? _IFormattableToString 
                : _objToString;

            Expression<Func<T, string, IFormatProvider, string>> formatLambda = 
                Expression.Lambda<Func<T, string, IFormatProvider, string>>(
                    Expression.Invoke(
                            toStringMethod, 
                            Expression.Convert(Expression.Property(t, prop), typeof(object)), 
                            fmt, 
                            fmtr
                        ), 
                    t, 
                    fmt, 
                    fmtr
                );
            _parameters.Add(prop.Name, formatLambda.Compile());
        }
    }

    private static Dictionary<string, Func<T, string, IFormatProvider, string>> _parameters;
}

Build a Fast Format Function

So with the parsing done, and the type pre-reflected, building up a fast formatter is just a case of stitching together the function calls:

public Formatter(string format)
{
    if (format == null) throw new ArgumentNullException("format");
    _generators = GetTokenGenerators(format).ToList(); // note, GetTokenGenerators() is 
                                                       // enumerated now, at construction time, 
                                                       // allowing same format string to be reused 
                                                       // without further regex hits
}


public string Format(T dictionary, IFormatProvider formatProvider)
{
    if (dictionary == null) throw new ArgumentNullException("dictionary");
    return _generators.Aggregate(new StringBuilder(), 
                                 (builder, val) => builder.Append(val(dictionary, formatProvider)), 
                                 builder => builder.ToString()
                                );
}
// parses the format and returns a set of functions that output part of the format result
private static IEnumerable<Func<T, IFormatProvider, string>> GetTokenGenerators(string format)
{
    int index = 0;

    foreach (Match match in _re.Matches(format))
    {
        if (match.Index > index) // don't bother returning for an empty string
        {
            string previousText = format.SubString(index, match.Index - index);
            yield return (t, fmtr) => previousText;
        }

        if (match.Groups["mismatchedBrace"].Success) 
                throw new FormatException("Mismatched brace in format expression at " + match.Index);
        else if (match.Groups["escapedOpenBrace"].Success) yield return OpenBrace;
        else if (match.Groups["escapedCloseBrace"].success) yield return CloseBrace;
        else
        {
            debug.Assert(match.Groups["name"].Success);

            string name = match.Groups["name"].Value;

            Func<T, string, IFormatProvider, string> formatLambda;
            if (!_parameters.TryGetValue(name, out formatLambda)) 
                    throw new FormatException("No such value in dictionary type: " + name 
                                            + " in format expression at " + match.Groups["name"].Index);

            string specifiedFormat = null;
            if (match.Groups["fmt"].Success) specifiedFormat = match.Groups["fmt"].Value;

            if (match.Groups["propertyChain"].Success)
            {
                // take a slower path
                yield return (t, fmtr) => System.Web.Ui.DataBinder.Eval(t, name 
                                       + match.Groups["propertyChain"].Value, "{0:" + specifiedFormat + "}");
            }
            else
            {
                yield return (t, fmtr) => formatLambda(t, specifiedFormat, fmtr);
            }
        }

        index = match.Index + match.Length;
    }

    if (format.Length > index)
    {
        string previousText = format.SubString(index, format.Length - index);
        yield return (t, fmtr) => previousText;
    }
}

It’s prebuilding this formatter that explains the massive performance improvement in the compiled version of the benchmark. The formatter is strongly typed to deal with values coming in from a specific type, to be formatted with a particular string.

So, i should stress: this was a quick lunchbreak hack; it’s not been deliberately optimized, nor deliberately made particularly pretty. It’s probably got some pretty hairy edges. On the plus side, it passes Phil’s test suite.

Speaking of test suites, I should mention Peli’s findings when he pointed Pex at the various implementations Phil posted. As should be expected, mine is also fully incompatible with all the others. In particular, mine aggressively throws FormatExceptions when it finds mismatched braces, more so than any of the others. But I will certainly  be having more of a play with Pex – it looks an amazing tool.

Caveats aside, if you want to play with my code, it’s here. Drop it in Phil’s NamedStringFormatSolution and have a play.

Print | posted on Friday, January 16, 2009 10:38 AM

Comments on this post

# Named Formats Redux

Requesting Gravatar...
Named Formats Redux
Left by Pingback/TrackBack on Jan 19, 2009 1:22 AM

# Named Format Strings

Requesting Gravatar...
As you might have seen, there has been some blogging about named format strings in the last days.
Left by Pingback/TrackBack on Jan 19, 2009 1:25 AM

# re: Named Format Strings

Requesting Gravatar...
I can't reproduce your timings.

I dropped HartFormatter.cs into my project and added

MeasureFormatTime("HartFormat", () => format.HartFormat(o));

to the test console app.

I get (in release mode) -
Hanselformat took 0.1107 ms
OskarFormat took 0.143 ms
HaackFormat took 0.075 ms
HenriFormat took 0.0594 ms
HartFormat took 0.0675 ms

So it looks like HenriFormat is still faster by a bit.

Then I changed the Main function to be:


static void Main(string[] args)
{
string format = "{foo} is a {bar} is a {baz} is a {qux:#.#}";
var o = new { foo = 123, bar = true, baz = "this is a test", qux = 123.45 };

MeasureFormatTime("HenriFormat", () => format.HenriFormat(o));
GC.Collect();
MeasureFormatTime("HartFormat", () => format.HartFormat(o));

Console.ReadLine();
}


Now I get something like:
HenriFormat took 0.0618 ms
HartFormat took 0.0603 ms

If I swap the order I get something like:
HartFormat took 0.0636 ms
HenriFormat took 0.0604 ms

I'm not sure how to interpret these results but, based on these, I won't say HartFormat is (uncompiled) is any faster than HenriFormat.

Maybe I'm doing something wrong.
Left by Henri on Jan 19, 2009 6:47 PM

# re: Named Format Strings

Requesting Gravatar...
Couldn't see why Henri was getting such a different result to me; On a hunch, I just recompiled the benchmark to target 'x86' instead of 'Any CPU'. I run Vista 64, so my benchmark results above are running in the 64 bit CLR. Changing the compile target let me test in th 32 bit CLR. And - weirdly - in the 32bit CLR, I get the following results:

HartFormat took 0.08462 ms
Hanselformat took 0.14748 ms
OskarFormat took 0.20224 ms
JamesFormat took 0.12502 ms
HenriFormat took 0.0846 ms
HaackFormat took 0.10188 ms
ConcatFormat took 0.00408 ms
StringFormatFormat took 0.0048 ms
StringBuilderFormat took 0.00356 ms
HartFormat - precompiled took 0.01068 ms

Some interesting differences there!
Left by James on Jan 19, 2009 10:44 PM

# re: Named Format Strings

Requesting Gravatar...
So the string.format would perform this operation 4/TenThousandths of a second faster If this function is going to get called a ton you might notice that difference. But it at least answers his question instead of just telling him to do it the same way he already said he didn't want to do it.
Left by sicherstes Casino on Feb 27, 2010 6:38 AM

# re: Named Format Strings

Requesting Gravatar...
One important rule is placing style sheets at the top, within the HEAD tags. While most browsers will load all styles eventually regardless of placement on the page, putting styles in the HEAD helps the page load progressively, giving users visual feedback that something is loading.so thanks for the update post and give this informative article
Left by sites pour miser en ligne on Mar 05, 2010 7:35 AM

# re: Named Format Strings

Requesting Gravatar...
One important rule is placing style sheets at the top, within the HEAD tags. While most browsers will load all styles eventually regardless of placement on the page, putting styles in the HEAD helps the page load progressively, giving users visual feedback that something is loading.so thanks for the update post and give this informative article
Left by watches replicas on Apr 29, 2010 2:56 AM

# re: Named Format Strings

Requesting Gravatar...
Is there any specific reason that you're not using any template engine for email formatting? other than speed? I am using template engines for this purpose and it works good.
Left by site de jeux de hasard virtuels on Apr 30, 2010 6:35 AM

# re: Named Format Strings

Requesting Gravatar...
I can't reproduce your timings.One important rule is placing style sheets at the top, within the HEAD tags
Left by car dvd on Jun 08, 2010 8:58 AM

# re: Named Format Strings

Requesting Gravatar...
Then I changed the Main function to be:


static void Main(string[] args)
{
string format = "{foo} is a {bar} is a {baz} is a {qux:#.#}";
var o = new { foo = 123, bar = true, baz = "this is a test", qux = 123.45 };

MeasureFormatTime("HenriFormat", () => format.HenriFormat(o));
GC.Collect();
MeasureFormatTime("HartFormat", () => format.HartFormat(o));

Console.ReadLine();
}
Left by wholesale car dvd on Jun 08, 2010 9:01 AM

Your comment:

 (will show your gravatar)
 
Please add 6 and 5 and type the answer here: