The Silver Bullet That Wasn’t

by Ville Laurikari on Tuesday, December 8, 2009

Post image for The Silver Bullet That Wasn’t

I have a confession to make. I occasionally fall prey to a most basic mistake: I think a tool will solve a problem before I even understand the problem.  Here’s what happened this time…

Like any company, we have an intranet. Ours is a motley collection of internal websites and tools: we have a wiki, a bug tracker, documents on file shares, and a number of other services of varying importance. Finding information is hard if you don’t know where to look. So I figured the best way to solve the problem is a Google Mini.

A Google Mini is your own private google rolled into a single rack unit. It crawls your network, indexes your content (including PDF and various Microsoft formats), and comes with the familiar google search interface. It sounded like the perfect thing to solve our little problem.

So I ordered one. The device arrived less than 24 hours after the order which was impressive given that it was shipped from the UK to Finland. Setting it up was easy and everything worked as advertised.

The obvious next step was to let the bot loose to crawl our network.  I could hardly wait to get all the information in our intranet at my fingertips.  What could go wrong?

Lesson 1: Some services choke when crawled.

It hadn’t occurred to me that our internal web servers had never been crawled before. Oops.

Some pages would take several seconds of CPU time to generate and googlebot was pinning the CPU to 100% on these servers.

Some servers generated never-ending lists of URLs, so googlebot was constantly finding “new” pages.  It quickly went over the limit of number of documents supported by the device.

Most of these problems were solved by blacklisting certain URL patterns in the Google Mini configuration.  Also, instead of a “crawl everything” approach, I switched to a “crawl only selected servers” approach.

Lesson 2: Some services cannot be crawled at all.

Most of our internal web services are not designed to be crawled at all. They may require logging in, for example, and Google Mini may have trouble handling that. None had a proper robots.txt and sitemap.  I’m still working on getting some of that stuff indexed at all.

Lesson 3: Indexing bad content won’t improve anything.

It’s like shining a bright light on a big pile of crap. It’s still a pile of crap, but you can see it more clearly.

Our intranet is hardly built using good SEO practices, so it’s natural that googlebot has trouble figuring out what pages and keywords are important.  But mostly the problem is just an abundance of obsolete and outdated content.  The solution will be to to get rid of most of it.

Don’t get me wrong: our Google Mini works very well, it’s already been quite useful, and I’m quite happy with it.  I can actually find stuff from our wiki now.  But it won’t turn a collection of obsolete Word documents from ‘02 into useful content.

If you liked this, click here to receive new posts in a reader.
You should also follow me on Twitter here.

Comments on this entry are closed.

{ 5 comments }

Life Tester December 9, 2009 at 02:22

maybe you should try also http://status.net (internal micro blogging :-) + twhirl (its desktop client)

optional (a bit late now, having already the G-mini) http://dataparksearch.org
an G-mini vs DataparkSearch overview is in http://blog.dataparksearch.org/42

Ville Laurikari December 9, 2009 at 03:57

The micro blogging thing sounds interesting. Like a lot of dev shops we use IRC, but it’s hard to get non-developers on there. Something easier to start with and a little prettier would help. Thanks for the tip!

Michel Billard December 9, 2009 at 04:53

I didn’t even know that thing existed (Google mini). We’re also trying to figure out how to manage all the different applications we use (like you we use a wiki, a bug tracker and a few custom tools) so our information is all over the place. We’re looking into Jira for our project management and case management. The problem is finding solutions that we can host ourselves for security reasons (although I’d feel perfectly safe having our data elsewhere, my bosses don’t).

The tools need to be appropriate, but it’s more about how we use them that makes them useful or not.

Ville Laurikari December 9, 2009 at 11:14

Michel, sounds familiar. We’ve found that linking our systems helps somewhat (version control ↔ bug tracking & project management, buildbot ↔ IRC, etc.).

Now, if you will excuse me, I need to go perform some internal SEO to get even more out of our Google Mini.

Susanna Kaukinen March 21, 2010 at 22:31

Well, you wrote elsewhere that the code is your enemy. Perhaps the documentation can be that in quite a similar fashion. You would need to just throw out most of it. And yes, I can imagine what people will say if you even suggest it.

{ 2 trackbacks }

Previous post:

Next post: