Web Site Management

RSS feed for this category only.

Making Wordpress Zippy

wpcache2.gifOver the last couple of weeks, we’ve had fun working with the gang at Federated Media to put together their Holiday Gadget Guide. Deane’s one of the contributing authors, so I’m sure he’ll post a little more about it later on.

It was a fun little site to put together and our team had a good time, but reality hit hard last night when BoingBoing posted it, traffic poured in, and everything slowed to a crawl. It became obvious that The Long Tail you hear about on blog posts can be connected to a very large dog, and Wordpress wasn’t keeping up with demand.

Enter a truly excellent Wordpress Plugin, WP-Cache2. WP-Cache2 installs into wordpress, walks you through all of the setup via the admin interface, and provides a friendly, easy-to-use caching system for Wordpress. With that, plus a few server tweaks, we were able to get things humming along again in no time. Definitely one for the bookmarks file if you run a Wordpress site and ever worry about a SlashDotting.

The Pointlessness of Page Views

Pageviews are Obsolete: It’s about time that this point is evangelized. Page views are a fairly pointless measurement these days with the advent of Ajax, RSS, and widely varying site designs which can have dramatic effects on how “hungry” a site is for page view stats.

But Ajax is only part of the reason pageviews are obsolete. Another one is RSS. About half the readers of this blog do so via RSS. I can know how many subscribers I have to my feed, thanks to Feedburner. And I can know how many times my feed is downloaded, if I wanted to dig into my server logs.

But what do you replace this statistic with? Via Boing Boing.

Silktide Sitescore v 1.7.2

Silktide’s Sitescore is kind of a neat tool. Plug in your website, and it gives you a 1-10 score on…

How well marketed, and popular the website is.
How well designed and built the website is.
How accessible the website is, particularly to those with disabilities.
How satisfying the website is likely to be.

Gadgetopia did pretty well with 8.2 points. Marketing was 9.6; design, 9.8; and experience, 9.7. The one aspect that hurt us was accessibility, which topped out at 5.6 points.

This website appears to be in violation of the British Disability Discrimination Act. All pages were found in violation of the the current W3C Web Content Accessibility Guidelines.

This website is probably unlawful in Britain from the 1st October 2004. The British Disability Discrimination Act makes it unlawful to discriminate against a disabled person by refusing to provide any service provided to members of the public - including websites.

Careful, or we might find ourselves locked up in the Tower of London. Since Silktide is a web development business, I’m sure they’d be willing to help us fix this little deficiency, for a tidy sum.

Google Analytics

Google Analytics: All your traffic are belong to us.

Google Analytics tells you everything you want to know about how your visitors found you and how they interact with your site. You’ll be able to focus your marketing resources on campaigns and initiatives that deliver ROI, and improve your site to convert more visitors.

Via Joseph Scott.

Spiders are Stupid

I’ve been monitoring the 404s on this site. I changed our URL pattern a while back, so I have a page that catches all the 404 and resolves the old pattern against the new one, then redirects. Anything that doesn’t resolve gets logged and I have an RSS feed where I can watch them all.

Which brings me to my point: Web spiders are pretty stupid. Ninety-nine percent of 404s to this site are from spiders. They’re looking for URLs that:

  • …that they couldn’t possibly have derived from any other page on the site.
    Oftentimes they screw up relative vs. absolute URLs. I usually go check, just in case I forgot to put “http://” in front of something, but I usaully find everything is in order and it must just be the spider that’s confused.
  • …existed a long, long time ago.
    I still get spiders coming in for pages with URLs that haven’t been around for three years. They must have them stored somewhere because every once in a while I’ll get about 300 consecutive requests from the same spider for the same old pattern, like it was reading them from a file somewhere.
  • …are obviously munged.
    Spiders truncate a lot, or insert random spaces in URLs. I finally modified my lookup script up to remove spaces from the target URL first, and, if it can’t find what the want, try to match what they ask for at the front of a string, so I can catch truncations.

I’ve also noticed a lot of one-off spiders that I’ve never seen before. They come out of colleges a lot, it seems.

And, of course, there are hack attempts galore. Trying to hack the XMLRPC vulnerability that was revealed a few months ago is pretty common, and I get scads of long, long requests for things in ”_vti” directories.

That said, monitoring your 404s is a really handy thing to do as it alerts you to a lot of problems. We have over 4,500 entries now, and by watching bad requests, I find out all the time about bad links, missing images, etc. It’s really a good, simple way to give you an extra leg up on fighting content rot.

But don’t think the spiders are the smart ones. You’d think since they were programmed by (supposed) professionals, and have everything in a database somewhere, that they’d be pretty on top of things. My experience, however, indicates that a bunch of two-year-olds mashing on the keyboard would probably come up with more valid URLs than your average Web spider.

Gadgetopia Screen Resolution Stats

I’m using a new stat tracking app called Mint. It has a plug-in called “User Agent 007” (ho, ho — what wit) that captures browser stats. Interesting stats:

  • almost 90% of Gadgetopia visitors are at 1024 x 768 or greater
  • almost 20% of visitors are at higher resolution than 1024 x 768

Here’s the entire list of resolutions and their penetration. At what point do you abandon the 800 x 600 crowd and start using the extra space for those that have it, I wonder?

But even with more space, you always run into the problem that “screen resolution” and “browser viewing area” are two very different things. With the sidebars that people run these days, you don’t get anywhere near the full width to work with. I have my resolution at 1024 right now, but I think I use 200 pixels in the bookmark sidebar in Firefox.

Email All Your Users Day Re-visited

Email All Your Users Day: I posted this two years ago today. I still think this is a great idea.

[…] I hereby proclaim December 1 as “Email All Your Users Day.” On that day, everyone who runs a service that has user accounts should email ALL their users to remind them they have an account, what the account is for, and where the login screen is. The member can then decide what he or she wants to do with it.

Do Anonymous Domain Registration Outfits Actually Work?

About anonymity …: Think you’re safe if you register your domain name “anonymously”? Apparently not:

Despite paying Domains by Proxy an additional fee to register foetry.com anonymously, they responded to a letter from a personal injury lawyer, and canceled my registration without notifying me of a complaint. Let that sink in: a personal injury lawyer’s letter is all it took for DBP to cancel my anonymity. Furthermore, the attorney’s ignorance of Internet Law didn’t even phase Domains by Proxy. (I have a copy of the attorney’s letter and I know more about Internet law than he).

Boise State University Professor Janet Holmes, simply hired the lawyer to write a letter. That’s all. There was no subpeona. No chance of a case against me. Domains by Proxy never emailed me and never telephoned. They simply canceled the anonymity and my confidential information suddenly became available. My initials, my address, and my phone number became freely available to anyone with an internet connection.

Via Metafilter.

Robots.txt Survey

Robots.txt, The Big Crawl: These guys grabbed 75,000 robots.txt files, and found a few problems:

[…] we found a wide array of problems with peoples robots.txt files. We found more than 5% of the robots.txt used bad style and up to 2% were so badly formed that they would not be recognized by any spider.

One of the most common mistakes is backwards syntax […] A large number of people had multiple directories per line […] Another common mistake, is editing your robots.txt in DOS mode

Not only do they tell you the problems they found, but they explain how various spiders would interpret the problems. Some of the “problems” are correct per the spec, but spiders don’t always follow the spec…

Unresolved 404 Patterns

I changed the URL scheme of this Web site over the weekend. I had been meaning to do it for a while, but some problems with Movable Type 3.2 kind of forced the issue. (I have got to stop rushing into every beta that presents itself…)

To make everything backwards compatible, I built a simple redirect system — I have a table in the database with every single permalink from the old site (all 9,000 of them — including entry RSS feeds and category pages) mapped to every single new URL.

If someone looks for a page which has moved, the 404 page does a lookup on this table, “resolves” the old URL against a new one, then redirects with a “301 Moved Permanently.” It seems to work well.

A side benefit of this system is that I can watch for “unresolved 404s,” meaning 404s that were not in my lookup table — a genuine 404, if you will. I’ve noticed some interesting phenomena:

  • I get hammered by referrer spam. We’ve talked about this before — this is spam created by a bot hitting any page with a fake referrer string in the hopes that you’re displaying your referrers on this site (a la Dean Allen’s Refer or similar tool).

    This results in fully half the unresolved 404s on this site coming from casino bots hitting URLs that are three years old. I know they’re that old because they use the very first URL scheme I had for this site — the default Movable Type archives URL: “archives/000355.html”, etc.

    They must be working off a very old list of URLs, which I find quite funny, and quite interesting. Why would they keep an old list of URLs lying around? Why not just re-spider? Do spammers sell lists of URLs like they do lists of emails?

  • Browsers and spiders sometimes mangle HREFs. I see impossible URLs that can only result from a mis-interpretation of the HREF in the link. IE 5.x on the Mac, for example, has problems with background images coded in CSS. The see that browser try to get this a lot:

    /'/bin/images/header.jpg'

    It's just mangled the URL of the image.

    Others, however, are more mysterious. Just two minutes ago, a spider tried to access a URL that it could only have hit if it missed the leading "/" in the HREF. Coming from this page...

    /2005/07/09/IsPerlStillRelevant.html

    …the spider tried to hit:

    /2005/07/09/4131

    I just checked that page and there’s no way it pulled that URL out of the code. The correct URL was…

    /4131

    But the URL it bounced off of could have only happened if it had a bug of some kind or if the HTML got mangled on the way down.

    I also get hits to things like this:

    /2005/04/15/EasyJavaScriptAutocompleteI

    No mystery here — that’s just a truncated version of this:

    <code>/2005/04/15/EasyJavaScriptAutocompleteIntellisenseScript.html</code>
    

    Truncation, it seems, happens a lot. The Ask Jeeves/Teoma spider, for instance, has been trying all day to get at URLs that are all truncated at 39 characters. Add “http://www.gadgetopia.com/” to that, and you get 64 characters.

    Why is this, I wonder? Was that the size of the database field they stored the URL in? More importantly, does it explain why I’ve never done so well in that index? I’m wondering now if my previously-long URLs have hurt my engine placement in other indexes besides Google.

  • As implied by the preceding two points, the vast majority of 404s are from bots. I’m sure this is true for all sites, but I never realized it so much until now.

  • Hack attempts abound. There are lots of attempts to hit DLL files in the (non-existant) “MSOffice” and “_vti/” directories. These are people trying to hack Outlook Web Access and various Web-enabled Microsoft Office technologies.

  • Spiders don’t crawl and index in the same pass. I changed the URL pattern late Friday night, then changed my mind about pattern to use when I woke up the next morning. This means the site was accessible under a certain pattern for about eight hours.

    In the following 48 hours, I saw attempt after attempt by bots to get to files under that pattern. This tells me that a crawler made a pass at the site during that eight hour window and stored the URLs it found. Then an indexer used that list to come back through the site a day later and index the text (sadly, in this case, the pattern had changed — I’ve since put in a RewriteRule to catch those).

AdSense and Borders

If you have Google Adsense on your site, here is the best piece of advice I can give you: don’t put borders around your ads. I had a border around my skyscraper banner on the right here, so it sat in its own little box.

A friend told me to take the border off. I figured it couldn’t hurt to try it, so I made the border white, so it just fades into the background. Nothing else was changed. I did it in the middle of the month, so the first half was with the border, the second half without.

The result in terms of clickthrough rate? A seventy percent increase.

Email All Your Users Day Re-Visited

Email All Your Users Day: This is one of those posts that I think never got the attention it deserved. I still maintain that this is a good idea, and now that Gadgetopia has some more traffic, I re-submit it to the blogosphere. Who’s with me?

I hereby proclaim December 1 as “Email All Your Users Day.” On that day, everyone who runs a service that has user accounts should email ALL their users to remind them they have an account, what the account is for, and where the login screen is. The member can then decide what he or she wants to do with it.

Snazzy 404s

404’s 4 U: this post over at Metafilter has some great links to various 404s. I liked the Zork one.

404 Research Lab . Not that I’m sorry for the double post, but I was inspired by this 404 and went searching for some more. Some of them are funny, some let you play games, some are just creepy.

Official MT Hosting

Movable Type Hosting Partner Program: Movable Type has two “hosting partners” offering pre-installed MT hosting: $5.95 and $9.95 a month.

Protecting Content Editors From Themselves

Say you put together a nice, static site for a client. There’s a lot of CSS, a fair amount of scripting (in whatever language — we’ll assume PHP here), a handful of images, and a lot of HTML. The client is going to manage the site with a WYSIWYG editor.

What’s the biggest danger to your site? The person you hand it over to, of course. Invariably, they’ll get into files they shouldn’t, delete images they shouldn’t, or embark on CSS “upgrades” that they shouldn’t.

Shortly thereafter, you’ll get a call that begins, “The site doesn’t look right…”

How do you prevent this? Well, with a lot of hosts, you can finagle a few ways to prevent them from messing with things they shouldn’t by using additional FTP users and some Apache directives.

Say you put together a nice, static site for a client. There’s a lot of CSS, a fair amount of scripting (in whatever language — we’ll assume PHP here), a handful of images, and a lot of HTML. The client is going to manage the site with a WYSIWYG editor.

What’s the biggest danger to your site? The person you hand it over to, of course. Invariably, they’ll get into files they shouldn’t, delete images they shouldn’t, or embark on CSS “upgrades” that they shouldn’t.

Shortly thereafter, you’ll get a call that begins, “The site doesn’t look right…”

How do you prevent this? Well, with a lot of hosts, you can finagle a few ways to prevent them from messing with things they shouldn’t by using additional FTP users and some Apache directives.

Many *nix-based Web hosting companies will allow you to set up additional FTP users with their own FTP directories. I’m going to use Plesk in this example, because that’s the platform we use at Gadgetopia. Other systems have similar ends, but the file paths will be different.

Consider this structure for a virtual host:

/
  httpdocs
  conf
  cgi-bin
  web_users
    editor

”/” is the root of the Apache virtual host. The master FTP account logs into this directory. There’s a lot of things in here that you don’t want messed with: the virtual host configuration files in “conf,” and the Perl scripts in “cgi-bin,” to name but two.

With Plesk, when you create a new FTP user, they get a directory in “web_users.” In this instance, we’ve created “editor.” This user’s files would be accessible with a URL of “www.site.com/~editor/” The “editor” directory, then, is their own virtual root.

Let’s say that our site has 10 HTML pages. When you’re done developing everything, put these pages in the “web_users/editor” directory instead of the virtual root and give your editor FTP credentials to that directory only.

Then, in the configuration file for the virtual root, add some lines like this:

Alias ^/about_us.html$ [...]/web_users/editor/about_us.html

(“[…]” would be replaced with the path to the Apache virtual root, be it “/home/httpd/vhosts/domain_name” as with Plesk or whatever.)

This means, when a visitor requests the “About Us” page, Apache pulls it from the “editor” directory — to which the user has all rights.

(Yes, this page can also be accessed like this:

/~editor/about_us.html

If that stresses you out, this directive…

AliasMatch ^~editor/.*$ /doesnt_exist.html

…will send direct request to the editor directory spinning off into 404 land. An ugly, but effective, solution.)

To manage the HTML content, the editor will FTP into the “editors” directory (they’ll be deposited there when they use their credentials) and see only the HTML files in there. The “editor” directory will be the “top” directory the editor can get to. The editor won’t see any of the PHP files you use to make the site run, nor will he or she be able to get into the cgi-bin, the configuration directory, the SSL source directory, etc.

If you just have a handful of pages, you can enter an Alias rule for each one. If you have a lot, or if the user can create more on his or her own, have a rule like this:

AliasMatch ^/([^.]).html$ [...]/web_users/editor/$1.html

This will take a request for…

/[whatever].html

…and pull it from…

[...]/web_users/editor/[whatever].html

You could do it using rewrites as well:

RewriteEngine On
RewriteRule /([^.]).html /~editor/$1.html

If you want to give the editor the ability to create subfolders and such, it gets a little more difficult, but not much. Just figure out what folders you have in the (actual) root of the site, and send anything else to the “editor” directory.

For instance, say you have a folder called “static_images,” “php_bin,” and a stylesheet in the actual root of the site. This rule…

RewriteRule ^/([^static_images|php_bin|style.css].*)$ /~editor/$1

Will use the “editor” directory for any request not bound for those two directories or the stylesheet. (This is how I did it, but it occurs to me that there’s probably a much better way. I just stuck with the first thing that worked. If you know of a better method, please comment.)

Finally, if you have a script-happy editor, and you need to prevent them from writing any PHP, try this:

<Directory [...]/web_user/editor>
  php_admin_value php_engine off
</Directory>

This will kill the PHP engine for anything coming out of their directory. Be prepared for them to be mighty irritated.

Using the Alias* and Rewrite* set of directives in Apache, there are any number of different ways you can set this up. mod_rewrite in particular is amazingly feature-rich.

Example: there is a way to configure mod_rewrite to check two roots for a file. If it doesn’t find a file in the actual Web root, it will check the alternate Web root (the editor’s directory). Thus, you, as the developer, have “first rights” to any URL (“/search.php” for instance).

Your editor can have any URLs that you haven’t used. They can create “search.php” pages in their root all day long, but any request to “/search.php” will still pull from the page in the actual Web root, that only you have access to. (Yes, they can just change where the search form points, but this is a much more delibrate — and incriminating — action.)

To be sure, editors can still screw themselves. But you’ve sealed off some of the more obvious ways, and eliminated that many more headaches from your job.

NOAA Web Site Traffic

Public storms NOAA site: Ivan hasn’t been the only thing the National Oceanic and Atmopheric Administration has had to worry about. Keeping their Web site up hasn’t been a rose garden either.

Also collecting up-to-the-minute, high-resolution images, NOAA’s Web site has received a record number of hits during this hurricane season. In the first eight days of September, the site received 200 million hits — equivalent to one-third of the total traffic for all of 2003, when the United States was hit by one hurricane, Isabel.

ClickSpotting

ClickSpotting: This is a random Google AdWord I followed, but it looks like a pretty clever approach to web site analysis. Their twist is that they don’t just look at the fact that users go from page A to page B, but rather exactly where on the page people clicked.

ClickSpotting is a new approach in web analysis. Traditional web analysis informs you about how a visitor navigates from one page to another but not about what the visitor actually does on a page of the website.

ClickSpotting’s results help you understand your visitors on page behavior. It allows you to use the information in a very intuitive way and measure the impact of site optimization like you have not seen before.

This really looked gimmickly to me, until I watched the demo. A very intuitive and natural way to see how your site is being used. Too bad it’s WAY too expensive for all but the largest sites ($500 setup, $250 for a report for one page. Ouch).

Fighting Content Rot

If you manage a Web site for more than a few months, you run into problems of content rot. You’ll be cruising through some old pages, and you’ll find stuff that’s…off, for one reason for another.

For instance, when this blog first started, I was anal-retentive about enclosing BLOCKQUOTEd text in quotes. It was a quote, after all. I would go through all the text I quoted, find double quotes, convert them to singles, then surround the entire thing in double-quotes before BLOCKQUOTEing the entire thing.

Now, this was very admirable of me, but when I started inviting others to blog with me, that whole concept broke down. Not everyone was doing it, and since it wasn’t consistent, I didn’t want to do it at all. However, there are still a thousand or so entries sitting out there with quotes around them.

Just recently, we started to standardize code fragments we post with by using the CODE tag and the SimpleCode script. There remain, however, a hundred or so posts with code hacked up in BLOCKQUOTEs or DIVs or God knows what.

These aren’t an isolated cases — there are styles that we’ve since abandoned, double-dashes that haven’t been replaced with the &mdash; entity, etc. I try to nail these things as entries hit the site, but I miss some. On top of all this, throw in link rot — links that just 404 over time — and comments. Ugh, comments…

I try to stay on top of comment spam, but I’m sure some get through. Additionally, there are stupid comments that slip by (why do people insist on testing my comment form with ‘fgfgfgfgfgf’ all the time?), and comments that aren’t relevant any longer — people complaining about bad links that I’ve fixed or mis-spellings that I’ve corrected.

Categorization is another thing. I added the Temple of Mac category at about entry #1,600. However, I didn’t bother to go back through all the old entries and move all the Mac-related entries to the new category.

Mix all this together, and you have a site that doesn’t really age well. I’m sure if I tooled through 100 old entries, I’d have something that needed to be fixed or corrected in at least 40 of them. How do you handle this? Gadgetopia is hurtling toward entry number 3,000, and that’s a lot of volume.

I’ve often thought that I should create a script that just generated 10 random entries a day for me to review. Each morning, I’d get an email with 10 entries in it that I need to look over and touch up. But how do you make sure you get them all before you start getting duplicates? I suppose you could log them all in a table and then join the entries table against it to filter out entries that had already been covered. Like this:

SELECT e.entry_id FROM mt_entries e LEFT JOIN already_reviewed r ON e.entry_id = r.id WHERE r.id IS NULL ORDER BY RAND LIMIT 10

(I haven’t tested this SQL, mind you.) Wrap some PHP around this, schedule it for the middle of the night, and you’d have 10 entries every morning that you can tune up. Perhaps I’d send 10 to myself, and three or so to each of the rest of the authors.

I think, however, I’m going to try something different. I’m on the verge of putting another sidebar on the front page called “One Year Ago Today” that lists the things were we talking about a year ago (see the OnThisDay plugin). I’ll schedule an automatic rebuild of the front page every morning at 1:00 a.m., then check the year-old entries while I’m eating my Crunchy Corn Bran in the morning.

Maybe this will work, maybe it won’t. If someone wants to take a stab at the mailer script (or if you already have), please post a link. If anyone else has any thoughts about content rot, let’s hear them.

How to Obscure URLs

How to Obscure Any URL: Great, great page on how spammers and scammers obscure URLs so most people don’t know where they’re going.

These tricks are known to the spammers and scammers, and they’re used freely in unsolicited mails. You’ll also see them in ad-related URLs and occasionally on web pages where the writer hopes to avoid recognition of a linked address for whatever reason. Now, I’m making these tricks known to you.

Also worth nothing is that this is a great page dedicated to substance over style. One page, very long, full of infomation with no worries about overly-frilly presentation. We need more pages like this.

Via Don Park.

It just won't die

Netcraft has written a short but interesting article that shows that a suprisingly large number of sites are still running on NT4, Satan’s own operating system (“all of the problems of Windows, without all the perks”). Microsoft named it a dead horse last summer. There are a number of interesting facts inside about server software adoption.

At present Microsoft is the only member of the Fortune 100 hosting its main public site on Windows 2003 Server, while Tesco Stores is the lone FTSE 100 company on Win2003. Recent data from our SSL Survey suggests that many e-commerce operations previously on WinNT4 are now shifting to to Windows 2003 Server.

We thankfully finished moving our web servers at work from NT4 to Win2K last year, and the reset buttons are no longer as shiny from all the fingers pressing them.