-
05-07-2007, 06:56 AM #1
Renter
- Join Date
- May 2007
- Posts
- 5
A Small File That Can Make A Big Difference
If you have a website, you know how important search engine placement is to driving new clients to your site. What you may not realize is that you can control which pages the search engines see by uploading a simple file to your site.
Before I get into the details, I think it's important to talk a little bit about how search engines work. Each of the major search engines (Google, Yahoo, MSN, and Ask) use what are called "spiders" or "robots" to try to visit every web page on the internet and add each one to their index. Once the spider adds your site to the index, the search engine then decides where each page will rank for certain terms.
The first thing a spider does when it visits a website is to look for a "robots.txt" file. This file tells the spider the areas of your site where it's not allowed to go. If you don't have one (or if yours is blank), you are telling the spider, "Please index my entire site."
Believe it or not, this is a problem!
It may seem counterintuitive to want to block the search engines from accessing certain areas of your site, but otherwise the spiders are going to spend a considerable amount of time indexing pages that will never rank, never bring you traffic, and never bring a client through your door. Blocking these pages from the spiders also funnels your site's PageRank to your optimized pages, which means they'll rank higher in the SERPs.
What type of pages should we block from the spiders? Anything that isn't optimized for search engine placement. A typical list would include contact pages, image galleries, policy pages, etc.
It may help to look at an example, so let's take a look at a robots.txt file I am most familiar with (I wrote it): agentBOOST.com/robots.txt.
The first line of the file looks like this:
User-Agent: *
This line is telling the spiders that the rules that follow apply to all spiders (the * means every spider). For an example where individual spiders are addressed, take a look at activerain's robots.txt file (activerain.com/robots.txt), which gives specific instructions to Googlebot (Google's spider) and ShopWiki.
Going back to agentBOOST.com's robots.txt file, the ten lines after the User-Agent line all begin with "Disallow:" which is then followed by a directory on our site. It should come as no surprise that each of these lines is telling the spiders that they are not allowed access to a certain directory.
The first two lines (/terms/ and /privacy/) disallow the search engine spiders from indexing agentBOOST's Terms of Service and Privacy Policy. While both of these pages are important to our users, I don't see the benefit of having search engines wasting their time or our bandwidth/PageRank on these pages; we don't aspire to rank for the term "Privacy Policy"!
The next five lines (/user/, /agent/, /bid/, /property/, and /logout/) block the spiders from trying to index areas of the site that were built for our registered members to navigate the site, but not for search engines. This seems like a good time to point out a very important, powerful, and dangerous aspect of the robots.txt file:
When you disallow a directory in your robots.txt file it also blocks all the subdirectories under that directory!
We don't have to add lines for /user/register/ or /user/password/, for instance, because these are subdirectories of /user/. Just make sure you don't abuse this power by adding "Disallow: /", which will block your entire site!
The next three lines (/blog/category/ and /blog/feed/) block the spiders from indexing areas of our blog that may be considered duplicate content. The last line (/blog/subscribe) disallows our blog's subscribe page, which isn't optimized for anything in particular.
Remember, search engines have finite resources and billions (trillions?) of pages to index. When the spider comes to visit your site don't let it waste time on pages that aren't going to do you any good! Utilizing a robots.txt file is a great way to hold the spider's hand and bring them to the content you worked so hard to optimize.
I hope you found this quick tutorial on robots.txt helpful and informative.
If you'd like us to show you how to get the most cost-effective real estate leads, with no monthly fee and no percentage of your commission, please visit us at agentBOOST.com.
Chris
agentBOOST.com
-
05-08-2007, 07:46 AM #2
Condominium
- Join Date
- Jan 2005
- Location
- England
- Posts
- 123
-
05-08-2007, 08:07 AM #3
Renter
- Join Date
- May 2007
- Posts
- 5
Hi justicewhite.
Yes, that's exactly what I am saying.Are you implying that not allowing the spiders index as many pages from your site as possible is a bad thing for search engine optimisation?
As I stated in the article, any pages that aren't optimized for search engines should be blocked from the index. Your Contact, Privacy Policy, and Terms and Conditions pages are all important pages for your users, but do no good in the search engines.If so, what kind of pages do you advise people to allow for indexing and what type of pages to disallow?
A good test for determining whether or not you should block a page is to ask yourself, "if this page ranked for the keywords on it, would it bring me business?" For instance, ranking for the terms "Privacy Policy" won't yield too many new clients!
I should also mention that Google recently released a robots.txt tool in their webmaster console, which further substantiates how important this is.
-
05-14-2007, 08:49 AM #4
Condominium
- Join Date
- Dec 2006
- Location
- Torrevieja, Spain
- Posts
- 243
Good pages to use your robot .txt file for are also dynamic pages as spiders find it hard to exit as well.
Robot .txt files are not the be all and end all of a site, I don't think they are important at all. Some of my sites have them and some don't, I don't see any relevance at all.
I have one site number one for its search term in all three SE's and that site doesn't have the robot.txt file and never has.
-
05-27-2007, 10:52 PM #5
Fixer Upper
- Join Date
- May 2007
- Posts
- 48
spanishproperty, I look at the robots.txt file as just another tool in the toolbox.
From looking at my stats programs for my sites, the bots are reading the robots.txt files.
Does it make a difference? I honestly don't know. But I'd rather spend three minutes writing a robots.txt file and have it, rather than not have it and find out later that I should.
And I also have pages on some sites that I absolutely do not want indexed or followed. So I look at the robots.txt files as backup for the <meta name="robots" content="noindex, nofollow"> tags.
-
06-22-2007, 07:54 AM #6
Fixer Upper
- Join Date
- Jun 2007
- Location
- Orlando, FL
- Posts
- 42
I think the robot file is good to have on your site but I agree it is just a drop in the bucket and if you really do not want a page indexed I would place a no index tag on the page to compliment the robot file.
-
06-28-2007, 02:29 AM #7
Condominium
- Join Date
- Dec 2006
- Location
- Torrevieja, Spain
- Posts
- 243
Or you can use a nofollow link as well if you don't want a certain page to be spidered.
Monkeyleg I am not saying it is bad or not needed, I am just saying there is a hell of a lot more things that contribute to a good site and good ideas for SEO than a robot.txt file.
-
08-27-2007, 06:41 AM #8
Fixer Upper
- Join Date
- Aug 2007
- Location
- Panama City, Panama
- Posts
- 18
Any advice on joomla sites?
Robots.txt is a really good resource, but what happens when you have some 800 pages that are useful all of them?.
-
08-27-2007, 12:25 PM #9
Fixer Upper
- Join Date
- May 2007
- Posts
- 48
ventasman, you just allow all of your 800 pages in the robots.txt file.
-
09-30-2007, 03:19 PM #10
Fixer Upper
- Join Date
- Sep 2007
- Posts
- 21
Robots File can be Dangerous
Hey there,
I have a few comments about your post:
You have to be very careful when you are blocking content from spiders. Let me tell you why.It may seem counterintuitive to want to block the search engines from accessing certain areas of your site, but otherwise the spiders are going to spend a considerable amount of time indexing pages that will never rank, never bring you traffic, and never bring a client through your door. Blocking these pages from the spiders also funnels your site's PageRank to your optimized pages, which means they'll rank higher in the SERPs.
Let's say you have four pages on your website that are all linked from your home page, so 5 pages total. Your home page has PR8. That would nice right? From the google pagerank documentation, each of your subpages would be getting a PR2 vote from your homepage.
PR Home page / 4 Outgoing Links = PR2 vote
Let's say one of these pages is your privacy statement that your home page links to. In addition, lets put that file in the robots.txt file so we can see what happens.
You are losing a PR2 vote from your homepage so you are wasting your pagerank potential because you are blocking it in your robots.txt file. If your privacy statement could be linking back to your homepage so it distributes pagerank throughout yoru website.
I do agree that this page isnt nice to have in the search results. That is why there is a noindex meta tag that you can include on your privacy statement. If this page does get indexed, you can delete it in the google webmaster tools Yahoo also has a utility to delete indexed pages.
You do not have to worry about this for your blog - if your content type for the feed is text/xml. Crawlers will see this differently than ordinary webpages. If you block your RSS feeds, then it is not worth having them at all because you are blocking crawlers that look for frequently updated content in them.The next three lines (/blog/category/ and /blog/feed/) block the spiders from indexing areas of our blog that may be considered duplicate content. The last line (/blog/subscribe) disallows our blog's subscribe page, which isn't optimized for anything in particular.
Bottom line, be careful what you block in your robots.txt file.
Sincerely,
Travis Walters



LinkBack URL
About LinkBacks






Reply With Quote

Bookmarks