Joho the Blog » robots.txt question
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

robots.txt question

[Note: This post is part of the Be Dumb in Public program, of which I am a lifetime member.]

There’s lots of good info on the Web about how to create a robots.txt file that will keep the major search engines from spidering your site. But I haven’t found instructions aimed at my precise level of ineptitutde. So, here goes…

Let’s say my “hideme.com”directory exists at root level. That is, my host won’t let me go any further down than that. I see hideme.com plus all the other directories I own. Let’s say I want to put in a robots.txt file to protect the contents of hideme.com, but I want to leave the rest of my directories open to search engines.

1. Is this the right robots.txt content:

User-agent: *
Disallow: /hideme.com/

I’m especially concerned about getting the slashes right.

2. Where exactly do I put the robots.txt file? At the same level as the hideme.com directory, where I can see all my directories? Or inside the hideme.com file? Or elsewhere? Thanks in advance. And have pity: I was a Humanities major.

Previous: « || Next: »

10 Responses to “robots.txt question”

  1. The key is to spell things from the robot’s perspective. You want this:

    User-agent: *
    Disallow: /

    and you want it inside the “hideme.com” directory. Then when the robot visits hideme.com/robots.txt, it sees it shouldn’t look at anything under this domain.

    You may need a second robots.txt files. Is the “root” directory served out by the web host? I.e. http://maindomain.com/hideme.com/ would pull up the hideme.com home page? In that case, you would *also* want the robots.txt file in your post (slashed as you have them) in the “root” directory, in order to prevent robots from finding those files via maindomain.com.

  2. I’m a little bit confused by your description, so let me ask some qualifying questions. Do you actually have a directory called “hideme.com”? Are the directories you want crawlers to find inside “hideme.com” (e.g. “hideme.com/first”, “hideme.com/second”) or at the same level (i.e. “hideme.com”, “first” and “second” all at the top)?

    If the former, what is it you’re trying to hide? Individual files? If that’s the case, the simpler answer is to create a dummy web page called “hideme.com/index.html”. That way any crawler that accesses “hideme.com” will get that page and will have no way to scan the directory for other files.

    If the latter, the syntax you’re using will tell crawlers to skip “hideme.com” and anything inside it but lets them look at “first” and “second”. Again, that assumes you don’t have an “index.html” or “index.cgi” or anything else at the top level. Because if you do, the crawler has no ability to scan the directory; it can only find filenames that are mentioned explicitly in the pages it can read (like that “index.html” itself).

    “robots.txt” is a pretty rough hammer; it doesn’t give you fine control over what is and isn’t accessible. For example, there’s no way to disallow a directory but allow directories inside it. And anyway, it relies on the crawler being a good citizen. The only thing that keeps a crawler from ignoring the “robots.txt” is good manners.

    Hope that helps,
    Hank

  3. are you american man?

  4. I respectfully disagree with David D. All specs that I have seen require robots.txt to be at the root level of the domain (example http://hyperorg.com/robots.txt). Scanning my Apache logs, I don’t see one single instance where any robot has ever looked any place else.

    If you are in a position where you can’t put something at the root level. For the case of HTML, you can use meta tags:

    http://www.robotstxt.org/wc/faq.html#noindex

    P.S. The current robots.txt at hyperorg denies Atomz nothing, but denies everybody else access to nine subdirectories of the following URI: http://hyperorg.com/hyperorg.com/www/, none of which actually exist.

  5. Don’t know if you worked out your problem, but as far as I can understand your description of the problem, you’re getting some advice of mixed quality above (I was confused by the advice – might be my fault)

    From what you write I understand your setup to be so that URL

    http://www.mydomain.com/myfile.html
    corresponds to
    /myfile.html
    and
    http://www.mydomain.com/hideme.com/myotherfile.html
    corresponds to
    /hideme.com/myotherfile.html

    Working from that assumption here are some pointers:
    1. The file the bots load is http://www.mydomain.com/robots.txt, so you should store yours as
    /robots.txt

    it should look like this
    User-agent: *
    Disallow: /hideme.com/

    – If hideme.com corresponds to an entire domain so http://hideme.com/myotherfile.html corresponds to /hideme.com/myotherfile.html
    you should go with comment no 1 instead and place a robots.txt containing

    User-agent: *
    Disallow: /

    in hideme.com

    – If I misunderstood your setup and you still have a problem, don’t hesitate to email me.

  6. I got this email from Tim Bray a couple of days ago and have asked him permission to post it:

    ——
    1. It’s strict prefix matching. So

    User-agent: *
    Disallow: /hideme.com/

    will work either with or without the slash, but I think putting the slash in is better practice.

    robots.txt *must* be at the root of your web-site. So for example, the URL of the one at http://www.tbray.org/ must be http:// http://www.tbray.org/robots.txt. So, after the first single slash. It turns out that the that computer is set up, from Unix’s point of view it’s /home/tbray.org/www/html/robots.txt, but the important thing is where it appears in URL-space, which has to be at the top level of your site.

    So in your case it has to http://www.hyperorg.com/robots.txt

    NOT http://www.hyperorg.com/blogger/robots.txt

    Fortunately, hyperorg.com is at the root of its own web server. Suppose you have webspace at http//big-bureaucratic-company.com/ people/my-own-webspace – if you want to control robots, you have to persuade whoever operates the big bureaucratic company to put it in http//big-bureaucratic-company.com/robots.txt, no robots.txt in your own webspace will ever get looked at. Sad but true.

    -Tim

  7. Two points not entirely covered above:

    1- remember that (as Sam Ruby illustrated) humans can view your robots.txt file too, so if your mission is to host private files on a public website and prevent people from finding them, publishing the name of the directory where they live is probably a bad idea. Not that obscurity buys you much anyway; if those private pages link anywhere offsite, and you click through, the HTTP_REFERER logs on the destination sites will get a copy of your “private” URL.

    If you’re looking to prevent public access, password-protect the entire directory. This would prevent indexing, too.

    2- As Sam Ruby also pointed out, the current robots.txt file on hyperorg.com appears to be broken. The URL patterns in the robots.txt file should be relative to the public document root — which is probably more easily illustrated than explained:

    Current rule:
    Disallow: /hyperorg.com/www/smallpieces/

    What you probably want instead:
    Disallow: /smallpieces/

    My suggested rule assumes this URL is valid and is the one you want to be skipped by the spiders:
    http://hyperorg.com/smallpieces/

    (These comments assume “hideme.com” is a directory name, rather than the docroot of a wholly other website that happens to live within the hyperorg.com website. I think the reason all the comments on this page are somewhat confusing is that the “hideme.com” example can be interpreted either way.)

  8. It seems that many webmasters still don’t know the concept of “website’s root directory”. That’s where your index file is and where your robots.txt file must be.

    The disallow directives inside a robots.txt file accept only directory and file paths starting from the website’s root directory.

    /hyperorg.com/www/ has nothing to do with the WEBSITE’s root directory. The website’s root directory is simply “/”.

    Also, check any robots.txt file with a robots.txt checker.

  9. how can i find owner of this site?

  10. If you want to deny a whole site, it would look like:
    User-agent: *
    Disallow: /

    or a certain directory:
    User-agent: *
    Disallow: /directory/

    A lot of times robots don’t listen although.

Leave a Reply

Comments (RSS).  RSS icon