PDA

View Full Version : A Small File That Can Make A Big Difference


agentBOOST
05-07-2007, 05:56 AM
If you have a website, you know how important search engine placement is to driving new clients to your site. What you may not realize is that you can control which pages the search engines see by uploading a simple file to your site.

Before I get into the details, I think it's important to talk a little bit about how search engines work. Each of the major search engines (Google, Yahoo, MSN, and Ask) use what are called "spiders" or "robots" to try to visit every web page on the internet and add each one to their index. Once the spider adds your site to the index, the search engine then decides where each page will rank for certain terms.

The first thing a spider does when it visits a website is to look for a "robots.txt" file. This file tells the spider the areas of your site where it's not allowed to go. If you don't have one (or if yours is blank), you are telling the spider, "Please index my entire site."

Believe it or not, this is a problem!

It may seem counterintuitive to want to block the search engines from accessing certain areas of your site, but otherwise the spiders are going to spend a considerable amount of time indexing pages that will never rank, never bring you traffic, and never bring a client through your door. Blocking these pages from the spiders also funnels your site's PageRank to your optimized pages, which means they'll rank higher in the SERPs.

What type of pages should we block from the spiders? Anything that isn't optimized for search engine placement. A typical list would include contact pages, image galleries, policy pages, etc.

It may help to look at an example, so let's take a look at a robots.txt file I am most familiar with (I wrote it): agentBOOST.com/robots.txt.

The first line of the file looks like this:

User-Agent: *

This line is telling the spiders that the rules that follow apply to all spiders (the * means every spider). For an example where individual spiders are addressed, take a look at activerain's robots.txt file (activerain.com/robots.txt), which gives specific instructions to Googlebot (Google's spider) and ShopWiki.

Going back to agentBOOST.com's robots.txt file, the ten lines after the User-Agent line all begin with "Disallow:" which is then followed by a directory on our site. It should come as no surprise that each of these lines is telling the spiders that they are not allowed access to a certain directory.

The first two lines (/terms/ and /privacy/) disallow the search engine spiders from indexing agentBOOST's Terms of Service and Privacy Policy. While both of these pages are important to our users, I don't see the benefit of having search engines wasting their time or our bandwidth/PageRank on these pages; we don't aspire to rank for the term "Privacy Policy"!

The next five lines (/user/, /agent/, /bid/, /property/, and /logout/) block the spiders from trying to index areas of the site that were built for our registered members to navigate the site, but not for search engines. This seems like a good time to point out a very important, powerful, and dangerous aspect of the robots.txt file:

When you disallow a directory in your robots.txt file it also blocks all the subdirectories under that directory!

We don't have to add lines for /user/register/ or /user/password/, for instance, because these are subdirectories of /user/. Just make sure you don't abuse this power by adding "Disallow: /", which will block your entire site!

The next three lines (/blog/category/ and /blog/feed/) block the spiders from indexing areas of our blog that may be considered duplicate content. The last line (/blog/subscribe) disallows our blog's subscribe page, which isn't optimized for anything in particular.

Remember, search engines have finite resources and billions (trillions?) of pages to index. When the spider comes to visit your site don't let it waste time on pages that aren't going to do you any good! Utilizing a robots.txt file is a great way to hold the spider's hand and bring them to the content you worked so hard to optimize.

I hope you found this quick tutorial on robots.txt helpful and informative.

If you'd like us to show you how to get the most cost-effective real estate leads, with no monthly fee and no percentage of your commission, please visit us at agentBOOST.com.

Chris
agentBOOST.com

justicewhite
05-08-2007, 06:46 AM
...

Remember, search engines have finite resources and billions (trillions?) of pages to index. When the spider comes to visit your site don't let it waste time on pages that aren't going to do you any good! Utilizing a robots.txt file is a great way to hold the spider's hand and bring them to the content you worked so hard to optimize.

...
Are you implying that not allowing the spiders index as many pages from your site as possible is a bad thing for search engine optimisation? If so, what kind of pages do you advise people to allow for indexing and what type of pages to disallow?

agentBOOST
05-08-2007, 07:07 AM
Hi justicewhite.

Are you implying that not allowing the spiders index as many pages from your site as possible is a bad thing for search engine optimisation?

Yes, that's exactly what I am saying.

If so, what kind of pages do you advise people to allow for indexing and what type of pages to disallow?

As I stated in the article, any pages that aren't optimized for search engines should be blocked from the index. Your Contact, Privacy Policy, and Terms and Conditions pages are all important pages for your users, but do no good in the search engines.

A good test for determining whether or not you should block a page is to ask yourself, "if this page ranked for the keywords on it, would it bring me business?" For instance, ranking for the terms "Privacy Policy" won't yield too many new clients!

I should also mention that Google recently released a robots.txt tool in their webmaster console, which further substantiates how important this is.

spanishproperty
05-14-2007, 07:49 AM
Good pages to use your robot .txt file for are also dynamic pages as spiders find it hard to exit as well.

Robot .txt files are not the be all and end all of a site, I don't think they are important at all. Some of my sites have them and some don't, I don't see any relevance at all.

I have one site number one for its search term in all three SE's and that site doesn't have the robot.txt file and never has.

Monkeyleg
05-27-2007, 09:52 PM
spanishproperty, I look at the robots.txt file as just another tool in the toolbox.

From looking at my stats programs for my sites, the bots are reading the robots.txt files.

Does it make a difference? I honestly don't know. But I'd rather spend three minutes writing a robots.txt file and have it, rather than not have it and find out later that I should.

And I also have pages on some sites that I absolutely do not want indexed or followed. So I look at the robots.txt files as backup for the <meta name="robots" content="noindex, nofollow"> tags.

orlandorealestate
06-22-2007, 06:54 AM
I think the robot file is good to have on your site but I agree it is just a drop in the bucket and if you really do not want a page indexed I would place a no index tag on the page to compliment the robot file.

spanishproperty
06-28-2007, 01:29 AM
Or you can use a nofollow link as well if you don't want a certain page to be spidered.

Monkeyleg I am not saying it is bad or not needed, I am just saying there is a hell of a lot more things that contribute to a good site and good ideas for SEO than a robot.txt file.

ventasman
08-27-2007, 05:41 AM
Robots.txt is a really good resource, but what happens when you have some 800 pages that are useful all of them?.

Monkeyleg
08-27-2007, 11:25 AM
ventasman, you just allow all of your 800 pages in the robots.txt file.

twalters84
09-30-2007, 02:19 PM
Hey there,

I have a few comments about your post:

It may seem counterintuitive to want to block the search engines from accessing certain areas of your site, but otherwise the spiders are going to spend a considerable amount of time indexing pages that will never rank, never bring you traffic, and never bring a client through your door. Blocking these pages from the spiders also funnels your site's PageRank to your optimized pages, which means they'll rank higher in the SERPs.

You have to be very careful when you are blocking content from spiders. Let me tell you why.

Let's say you have four pages on your website that are all linked from your home page, so 5 pages total. Your home page has PR8. That would nice right? From the google pagerank documentation, each of your subpages would be getting a PR2 vote from your homepage.

PR Home page / 4 Outgoing Links = PR2 vote

Let's say one of these pages is your privacy statement that your home page links to. In addition, lets put that file in the robots.txt file so we can see what happens.

You are losing a PR2 vote from your homepage so you are wasting your pagerank potential because you are blocking it in your robots.txt file. If your privacy statement could be linking back to your homepage so it distributes pagerank throughout yoru website.

I do agree that this page isnt nice to have in the search results. That is why there is a noindex meta tag that you can include on your privacy statement. If this page does get indexed, you can delete it in the google webmaster tools Yahoo also has a utility to delete indexed pages.

The next three lines (/blog/category/ and /blog/feed/) block the spiders from indexing areas of our blog that may be considered duplicate content. The last line (/blog/subscribe) disallows our blog's subscribe page, which isn't optimized for anything in particular.

You do not have to worry about this for your blog - if your content type for the feed is text/xml. Crawlers will see this differently than ordinary webpages. If you block your RSS feeds, then it is not worth having them at all because you are blocking crawlers that look for frequently updated content in them.

Bottom line, be careful what you block in your robots.txt file.

Sincerely,
Travis Walters

the-ref
11-19-2007, 02:14 PM
Another thing to keep in mind when it comes to using robots files is there is now law that states search engines need to follow robots files. While it hasn't really happened yet with any of the big search engines, it will be interesting to see what happens when one of the big players stops reading robots files.

Greg
11-20-2007, 04:57 AM
IMHO why would anyone have pages on their site that are not optimized. That is just foolish or lazy. I have pages that I thought would never go anywhere and now they have a PR3 and are picking up long tail hits. I have registration forms that show up #1 for their optimized kw. How cool is that?

The more pages the se sees the better. IMHO

MAAOnline
02-06-2008, 11:34 AM
This is really interesting but it looks like it's a lot to remember. Does anyone have any experiences with using the robot.txt file on there websites? How has it effected your rating?