With the huge sites we build over time with and for our clients, one of the most painful parts is keeping the broken links out.
There are online checkers that can handle small sites (up to about 50 or 100 pages) but when you want to scan a site with 300 or 500 pages or more, you need a desktop application.
We use and recommend the Site Audit tool in WebCEO for advanced website checking, but a lot of the time, WebCEO is overkill. We don't care if our images have the alt tag. We don't care if our pages are considered slow right now. We just want to catch and fix the broken links.
In cases like this, we use Xenu Link Sleuth. Xenu Link Sleuth is a labour of love, created by an anti-Scientology programmer (every report contains a banner ad against Scientology). It's fast and reliable. Really fast.
For Xenu to do a maximum amount of good and not give too much useless information you need to get the settings right. Here are the ones we use:
Xenu Link Sleuth settings
Why these settings?
From the top:
- Parallel threads should be reduced to 10 or less. Five is even better. With thirty threads, there is a good chance you will overwhelm you shared server.
- Apply to all jobs checked: you don't want to have change these settings for every project.
- Ask for password or certificate when needed will allow you to spider hidden parts of the site. Be careful about being logged in or not with Internet Explorer, or Xenu might go through your CMS. A properly written CMS shouldn't delete content without a confirmation dialog but this is an option to be careful about.
- Redirections as errors should be off. While I do consider redirections errors for the most part, they are less urgent to address than broken links, especially internal ones.
- FTP and gopher URLs. Should be checked. If you have these links, it would be good to know if they are working or not. I haven't had any large ftp links on any of our sites, so I don't know if Xenu downloads the whole file or just touches it to make sure there is something on the other end. Checking the documentation, apparently Xenu only gives a list of ftp files. Useful enough to do a handcheck.
- Valid text URLs will give you a full list of all the URLs in your site. You don't need this.
- Site Map will create a sort of sitemap based on site structure. It's generally not been satisfactory for modern sophisticated dynamic site. More confusing than anything else. Leave it turned off.
- Statistics will give you a very good summary of your scan.
- Orphan Files you should always leave turned off. It can't handle ID type anchors which means it reports a lot of correctly working anchors as broken. The orphan files options has never given me any worthwhile information.
Here is the short version of the statistics:
Xenu Link Sleuth Statistics
Very nice. Very simple. We aren't doing too badly here, at over 99% ok.
One reason this looks so good is thanks to Xenu Link Sleuth itself.
To get best use of Xenu Link Sleuth, you'll want to set it to browse external links, but make sure to add a list of URLs not to check in the same format as here (with http://):
Xenu Starting Point dialog
The example above is only applicable to our sites. You'll have to include your own tracking services yourself. If you don't get this right, you'll get errors on every page and your reports will be next to useless. Make sure to include the http:// and then the full base URL of each service. Including shorthand like "google" or "statcounter" won't work. Trust us. We've tried it.
The simple solution to false errors on external linksis to turn off Check external links. This way the off site trackers are not checked. But external links aren't checked either. It's worth the extra trouble to get it right. It might take you two or three tries, but once you've figured it out once, you will be able to run Xenu trouble free in the future (although sometimesI've had trouble getting the Do not check any URLs preference to stick).
Other worthwhile link checking alternatives to Xenu include
- the W3C link checker. Online. Simple, straightforward, free. Times out after 100 pages.
- the SEOMoz Crawl test. Online. Unpaid version 5 pages. Paid version 50 pages. Very detailed reports. Nice formatting.
- WebCEO. Desktop application. Most comprehensive reports. Unlimited crawling. Paid, multipurpose tool. Can be depressing as all get out - it finds every flaw in your website.