Last month, just before I left on vacation, I posted three sets of data to help implementors of the HTML5 parser specification optimise their code. There are several implementations coming along, for example those that are part of the html5lib project and the one behind Validator.nu.
The three sets of data that I posted are all derived from parsing several billion documents from Google's Web search index using a parser I wrote in Sawzall.
The first set of data gives the relative aggregate distribution of invocations of the "in head", "in body", and "in table" insertion modes, for each of the insertion modes. This allows implementors to determine, for instance, that invoking the "in body" code while in a cell must be very efficient, while invoking the "in body" code from the "after frameset" code need not be as efficient, in case the implementor has a strategy that optimises one at the cost of another. See: documentation, data.
The second set of data gives the relative aggregate distribution of tokens for each phase/insertion mode pair. This can help implementors that are using a cascade of if statements decide on the right order for their statements. For instance, the most common token type seen in the "in body" insertion mode is character data, and the second most token is the start tag token for an a element, but the isindex start tag was almost never seen. This tells implementors that they should check for characters and a start tags long before checking for isindex tags. See: documentation, data.
The last set of data examines the number of attributes per element. It allows implementors to decide on the optimum memory allocation strategy for attributes. For example, since most elements have 9 or fewer attributes, the data structure that stores attributes can be optimised for simply having 9 attributes, using little memory, and if an element has more than this number of attributes, the implementation can switch to a separate implementation that is more memory-heaving but is optimised for large numbers of attributes. See: data.
I hope this data is useful!