How to use Robots.txt
How to Use Robots.txt This information will hopefully help you when setting up your own robots.txt . I know that you can control spiders from search engines as well as more malicious sources. The only tools you need are notepad, and access to your root folder via ftp.
What is Robots.txt? Robots.txt are special rows of text that can be uploaded to server who’s primary function is to control Search Engine robots.
What are robots you ask? Well most Search Engine robots are commonly referred to as spiders. These spiders filter through your websites via links and make records of everything they find. This given is then used in different ways by the search engines to generated search results. Why would the Want to Control Spiders? Good question, since the robots are generally doing to good things why would you want to hinder or stop them? Well the truth is there are lots of reasons why you would want to control the spiders here are a few examples of where robots.txt can be handy.
Under Construction When your site is under construction its not really much use to users or spiders. Prevent them indexing half or broken sites by using robots.txt So you can created a positive first impression when you’re ready to launch.
Bad Robots Unfortunately not all robots are friendly, robots.txt will help you stop the content and email scrapers as well as specific Search Engine spiders you don’t want checking your content or ones which are overloading your serveur. Search Engines Look for Them.
Most search engines I know check for to robots.txt. Google even gives you information on what they have found in your robots.txt rows in webmaster tools.
Duplicated or Sensitive Content
There are plenty of legitimate reasons for duplicate content both internally and externally. If you feel this is enough to endanger your site in certain search engines you may want to stop them crawling the sensitive pages. There is also a possibility the content on sensitive loads of your pages is irrelevant or you may wish to exclude these pages as well.
How to use Robots.txt Robots.txt is pretty easy to use, to created you just use notepad and follow the right syntax. Luckily this is also very similar and is comprised of only a few elements. The rows are made up of several records and each record features to user-agent section and to disallow section. You may choose to includes comments with the use of to ” #” at the start of the lines. The ” user-agent” section defines which robots should follow the command.
To list multiple robots simply use multiple user-agent lines. You can also use the wildcard character ” *” to force ALL robots to obey the following command. An example of this line is below. User-agent: *
The ” Disallow” section is used for specifying the directory or rows that should not be accessed. It is fairly simply sytax and you just need to includes the directory (excluding you base URL) or rows similar to the following example.
Again you can have multiple lines to disallow to selection of files or directories.
Disallow: /folder1/ The ” allow” command can also like in useful, it lets you specify to specific rows to allow in to directory you may have disallowed. It still needs to be paired with to User-Agent command but can be added in with disallow commands. Here is an example of to full record using all of the above. User-Agent: * Disallow: /folder1/ Disallow: /folder2/ Disallow: /folder3/ Allow: /folder1/important.html Allow with all the robots getting to your site?
Just use this simple tails to open the gates and ensure you’re not blocking anything User-Agent: * Disallow: Block All Situated down for to while for construction? This will block everything from spidering. Don’t forget to remove it when you’re done! User-Agent: * Disallow: / Linking to your Sitemap
A handy little trick that loads robots allow is to link to your sitemap inside your robots rows. Simply use the following example and modify the URL to suite your own domain. Sitemap: http://www.example.com/sitemap.xml
Uploading Your Rows Once you’re ready you can upload the files to your server. It MUST be placed in the root of the domain. If you want to test that the right robots have access to the right pages the Google Webmaster Tools section has a good tool. Hope this has helped you understand the loads of things you can do with robots.txt and will help improve SEO and secure your blog site.

![Robot txt mystery unraveled %blogging tips Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=02bc8d5a-2a3e-4ff4-99bf-c1854c940d2a)
















{ 6 comments… read them below or add one }
Great read John! Ironic but I just added a robots.txt file to my social network about 8 weeks ago and since the pr update I got a page rank of 4. I get a lot of spam on the upcoming.php page and told the SE’s to not index of follow that link. I also added other pages that showed duplicate content.
This is my file here,
# All robots will spider the domain
User-agent: *
Disallow: /templates/
Disallow: /3rdparty/
Disallow: /libs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /internal/
Disallow: /backup/
Disallow: /thickbox/
Disallow: /api/
Disallow: /evb/
Disallow: /avatars/
Disallow: /admin_index.php
Disallow: /admin
Disallow: /login.php
Disallow: /js/
Disallow: /img/
Disallow: /upcoming.php
p.s. I engaged this article!
bbrian017´s last blog ..7 Days 7 Colours Thailand | Thailand Art Photography
Some great tips here, robots.txt has always mystified me.
Twitter: tycoonblogger
(7 comments) July 2, 2009 at 10:46 pm
That is way over my head. I think I will outsource this to some one. Thanks for breaking it down though as I was not familiar with this.
Tycoon Blogger @Make Money Blogging´s last blog ..Twitter Voyeurism
i am not feeling very shy for saying that i didn’t know at all about the robot.txt but after reading your post i got to know some basics & seriously i am still in search of some information about the robot.txt & i am going for the forum discussion !Thanks for your perfect topic as robot .txt!it is very helpful to control bugs from the search engines!
Twitter: kikolani
(50 comments) July 3, 2009 at 4:47 pm
I have a tricky thing I’d like to use the robots for, but haven’t figured out how yet. Basically, my client has both the .net and .com of his domain. The site files are all hosted under .com, but .net is the primary domain that shows up in search results. However, if you type in any page.com, it will pull up the same content as any page.net, which seems like a duplicate content issue. Can I block the robots from crawling the .com site, even if the files are hosted on the .com site? I know, pretty odd.
~ Kristi
Kikolani´s last blog ..Fetching Friday – Resources Mashup, #FollowFriday, & Some Tennis Love
Kikolani,
If the .com and .net have identical content, you should just use a 301 redirect on the .com to avoid the issue you talked about. (I assume you are using virtual hosts?) If you are using apache this is pretty easy to do with htaccess.