Google HTML Analysis

Jan 25

Google HTML Analysis

Google Code: Web Authoring Statistics: Google parsed a billion Web pages and pulled some stats out of the HTML.

We can now add to this data. In December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata. The results we found are available below. We hope this is of use!

Some random notes:

  • The most common META tags specified:

    1. keywords
    2. description
    3. robots
    4. generator (thanks, FrontPage)
    5. author
  • The BODY tag is a huge repository of non-CSS badness (bgcolor, margin, link, etc.)

    Very few people put an “id” on the BODY tag. I do this for pages that directly relate to an identifiable object in the system, so that I can make per-object CSS changes, if necessary (having ‘id=”object_232”’ on your BODY tag is handy like you wouldn’t believe).

  • Very few people use COLGROUP. People should use it more.

  • Most popular class names for elements:

    1. footer
    2. menu
    3. title
    4. small
    5. text

    Google notes that these class names map “very well to the elements being proposed in HTML5.”

  • They single out GoLive for crappy HTML:

    GoLive’s footprints are all over the Web. A scary number of pages use <table gridx=”” gridy=”” showgridx=”” showgridy=”“>, not to mention the multitude of <csscriptdict>, <csactiondict>, and <csobj> elements.

    We have made this same distinction: “Adobe GoLive: Evil Incarnate” Those people should be shot.

  • There were enough misspellings of the “language” attribute for the SCRIPT tag that four of them registered appreciably in the analysis.

Really interesting stuff. There are hundreds and hundreds of observations in here worth reading if you have to deal with HTML on a daily basis.



Add Comment