I have a confession to make. I occasionally fall prey to a most basic mistake: I think a tool will solve a problem before I even understand the problem. Here’s what happened this time…
Like any company, we have an intranet. Ours is a motley collection of internal websites and tools: we have a wiki, a bug tracker, documents on file shares, and a number of other services of varying importance. Finding information is hard if you don’t know where to look. So I figured the best way to solve the problem is a Google Mini.
A Google Mini is your own private google rolled into a single rack unit. It crawls your network, indexes your content (including PDF and various Microsoft formats), and comes with the familiar google search interface. It sounded like the perfect thing to solve our little problem.
So I ordered one. The device arrived less than 24 hours after the order which was impressive given that it was shipped from the UK to Finland. Setting it up was easy and everything worked as advertised.
The obvious next step was to let the bot loose to crawl our network. I could hardly wait to get all the information in our intranet at my fingertips. What could go wrong?
Lesson 1: Some services choke when crawled.
It hadn’t occurred to me that our internal web servers had never been crawled before. Oops.
Some pages would take several seconds of CPU time to generate and googlebot was pinning the CPU to 100% on these servers.
Some servers generated never-ending lists of URLs, so googlebot was constantly finding “new” pages. It quickly went over the limit of number of documents supported by the device.
Most of these problems were solved by blacklisting certain URL patterns in the Google Mini configuration. Also, instead of a “crawl everything” approach, I switched to a “crawl only selected servers” approach.
Lesson 2: Some services cannot be crawled at all.
Most of our internal web services are not designed to be crawled at all. They may require logging in, for example, and Google Mini may have trouble handling that. None had a proper robots.txt and sitemap. I’m still working on getting some of that stuff indexed at all.
Lesson 3: Indexing bad content won’t improve anything.
It’s like shining a bright light on a big pile of crap. It’s still a pile of crap, but you can see it more clearly.
Our intranet is hardly built using good SEO practices, so it’s natural that googlebot has trouble figuring out what pages and keywords are important. But mostly the problem is just an abundance of obsolete and outdated content. The solution will be to to get rid of most of it.
Don’t get me wrong: our Google Mini works very well, it’s already been quite useful, and I’m quite happy with it. I can actually find stuff from our wiki now. But it won’t turn a collection of obsolete Word documents from ‘02 into useful content.