robots.txt question

10 Responses to “robots.txt question”

David D, on September 22nd, 2005 at 12:25 pm Said:

The key is to spell things from the robot’s perspective. You want this:

User-agent: *
Disallow: /

and you want it inside the “hideme.com” directory. Then when the robot visits hideme.com/robots.txt, it sees it shouldn’t look at anything under this domain.

You may need a second robots.txt files. Is the “root” directory served out by the web host? I.e. http://maindomain.com/hideme.com/ would pull up the hideme.com home page? In that case, you would *also* want the robots.txt file in your post (slashed as you have them) in the “root” directory, in order to prevent robots from finding those files via maindomain.com.

Hank Shiffman, on September 22nd, 2005 at 12:32 pm Said:

I’m a little bit confused by your description, so let me ask some qualifying questions. Do you actually have a directory called “hideme.com”? Are the directories you want crawlers to find inside “hideme.com” (e.g. “hideme.com/first”, “hideme.com/second”) or at the same level (i.e. “hideme.com”, “first” and “second” all at the top)?

If the former, what is it you’re trying to hide? Individual files? If that’s the case, the simpler answer is to create a dummy web page called “hideme.com/index.html”. That way any crawler that accesses “hideme.com” will get that page and will have no way to scan the directory for other files.

If the latter, the syntax you’re using will tell crawlers to skip “hideme.com” and anything inside it but lets them look at “first” and “second”. Again, that assumes you don’t have an “index.html” or “index.cgi” or anything else at the top level. Because if you do, the crawler has no ability to scan the directory; it can only find filenames that are mentioned explicitly in the pages it can read (like that “index.html” itself).

“robots.txt” is a pretty rough hammer; it doesn’t give you fine control over what is and isn’t accessible. For example, there’s no way to disallow a directory but allow directories inside it. And anyway, it relies on the crawler being a good citizen. The only thing that keeps a crawler from ignoring the “robots.txt” is good manners.

Hope that helps,
Hank

Anonymous, on September 22nd, 2005 at 12:46 pm Said:

are you american man?

Sam Ruby, on September 22nd, 2005 at 2:10 pm Said:

I respectfully disagree with David D. All specs that I have seen require robots.txt to be at the root level of the domain (example http://hyperorg.com/robots.txt). Scanning my Apache logs, I don’t see one single instance where any robot has ever looked any place else.

If you are in a position where you can’t put something at the root level. For the case of HTML, you can use meta tags:

http://www.robotstxt.org/wc/faq.html#noindex

P.S. The current robots.txt at hyperorg denies Atomz nothing, but denies everybody else access to nine subdirectories of the following URI: http://hyperorg.com/hyperorg.com/www/, none of which actually exist.

Claus, on September 23rd, 2005 at 7:45 pm Said:

Don’t know if you worked out your problem, but as far as I can understand your description of the problem, you’re getting some advice of mixed quality above (I was confused by the advice – might be my fault)

From what you write I understand your setup to be so that URL

http://www.mydomain.com/myfile.html
corresponds to
/myfile.html
and
http://www.mydomain.com/hideme.com/myotherfile.html
corresponds to
/hideme.com/myotherfile.html

Working from that assumption here are some pointers:
1. The file the bots load is http://www.mydomain.com/robots.txt, so you should store yours as
/robots.txt

it should look like this
User-agent: *
Disallow: /hideme.com/

– If hideme.com corresponds to an entire domain so http://hideme.com/myotherfile.html corresponds to /hideme.com/myotherfile.html
you should go with comment no 1 instead and place a robots.txt containing

User-agent: *
Disallow: /

in hideme.com

– If I misunderstood your setup and you still have a problem, don’t hesitate to email me.

David Weinberger, on September 24th, 2005 at 8:13 am Said:

I got this email from Tim Bray a couple of days ago and have asked him permission to post it:

——
1. It’s strict prefix matching. So

User-agent: *
Disallow: /hideme.com/

will work either with or without the slash, but I think putting the slash in is better practice.

robots.txt *must* be at the root of your web-site. So for example, the URL of the one at http://www.tbray.org/ must be http:// http://www.tbray.org/robots.txt. So, after the first single slash. It turns out that the that computer is set up, from Unix’s point of view it’s /home/tbray.org/www/html/robots.txt, but the important thing is where it appears in URL-space, which has to be at the top level of your site.

So in your case it has to http://www.hyperorg.com/robots.txt

NOT http://www.hyperorg.com/blogger/robots.txt

Fortunately, hyperorg.com is at the root of its own web server. Suppose you have webspace at http//big-bureaucratic-company.com/ people/my-own-webspace – if you want to control robots, you have to persuade whoever operates the big bureaucratic company to put it in http//big-bureaucratic-company.com/robots.txt, no robots.txt in your own webspace will ever get looked at. Sad but true.

-Tim

matthew mcglynn, on September 28th, 2005 at 1:45 am Said:

Two points not entirely covered above:

1- remember that (as Sam Ruby illustrated) humans can view your robots.txt file too, so if your mission is to host private files on a public website and prevent people from finding them, publishing the name of the directory where they live is probably a bad idea. Not that obscurity buys you much anyway; if those private pages link anywhere offsite, and you click through, the HTTP_REFERER logs on the destination sites will get a copy of your “private” URL.

If you’re looking to prevent public access, password-protect the entire directory. This would prevent indexing, too.

2- As Sam Ruby also pointed out, the current robots.txt file on hyperorg.com appears to be broken. The URL patterns in the robots.txt file should be relative to the public document root — which is probably more easily illustrated than explained:

Current rule:
Disallow: /hyperorg.com/www/smallpieces/

What you probably want instead:
Disallow: /smallpieces/

My suggested rule assumes this URL is valid and is the one you want to be skipped by the spiders:
http://hyperorg.com/smallpieces/

(These comments assume “hideme.com” is a directory name, rather than the docroot of a wholly other website that happens to live within the hyperorg.com website. I think the reason all the comments on this page are somewhat confusing is that the “hideme.com” example can be interpreted either way.)

Mike, on September 28th, 2005 at 8:19 pm Said:

It seems that many webmasters still don’t know the concept of “website’s root directory”. That’s where your index file is and where your robots.txt file must be.

The disallow directives inside a robots.txt file accept only directory and file paths starting from the website’s root directory.

/hyperorg.com/www/ has nothing to do with the WEBSITE’s root directory. The website’s root directory is simply “/”.

Also, check any robots.txt file with a robots.txt checker.

tomash, on June 8th, 2006 at 7:38 am Said:

how can i find owner of this site?

ZZPrices, on December 19th, 2006 at 1:35 pm Said:

If you want to deny a whole site, it would look like:
User-agent: *
Disallow: /

or a certain directory:
User-agent: *
Disallow: /directory/

A lot of times robots don’t listen although.

robots.txt question

Share this:

10 Responses to “robots.txt question”

Leave a Reply