Getting your page
 
 

Voyager: The Kosmix Web Crawler

Voyager is Kosmix Corporation's web crawling robot. It fetches documents from the web to build the index for the Kosmix search engine http://www.kosmix.com/ On this page, you'll find answers to the most frequently asked questions about the behavior of the Kosmix crawler.

Frequently Asked Questions

1. What is your crawler's HTTP user-agent string?

voyager/1.0

2.How often will Voyager access my web site?

Voyager attempts to access each web server no more than once every few seconds. This rate may occasionally increase due to network delays. It may also increase periodically as we test new crawler software while running our operational crawl at the same time.

3. How do I request that Voyager not crawl parts or all of my site?

The Robot Exclusion Standard provides a way for web site administrators to restrict robot access to their web server by specifying crawler directives in a file called /robots.txt Voyager caches a copy of /robots.txt for each web server and it refreshes every 24 hours. Therefore, it may take up to 24 hours to pick up any changes.

4. How can I control how frequently Voyager visits my site

Voyager respects a new directive in the /robots.txt file called "Crawl-delay". The syntax is "Crawl-delay: xx", where "xx" is the delay in seconds between successive crawler visits. If Voyager's access rate is problematic for your server, you can throttle it back to, say, once every 10 seconds with the following lines:

User-agent: voyager
Crawl-delay: 10

As with all /robots.txt changes, it will take up to 24 hours for Voyager to pick up the change.

5. Why is Voyager trying to access a file called robots.txt that isn't on my server?

/robots.txt is a file that contains directives for web robots that restrict access to all or part of a web site. For information on how to create a /robots.txt file, see The Robot Exclusion Standard. If you just want to prevent the "file not found" error messages in your web server log, you can create an empty file named /robots.txt.

6. Why is Voyager attempting to download incorrect or non-existent links from my server?

Voyager discovers web pages by extracting links from other web pages that it already knows about.  Oftentimes a page will get removed from a web site, but links to it remain from other pages. Incorrect page references may also be created directly by a web page author due to a typo or misspelling. When Voyager discovers these bogus links, it will attempt to crawl them.

7. Why isn't Voyager respecting my robots.txt file?

For efficiency reasons, Voyager caches a copy of the /robots.txt file locally, which it refreshes every 24 hours. It can therefore take up to 24 hours for changes in a /robots.txt file to get picked up by the crawler.

If the /robots.txt file is not in the proper location, it wont get picked up. Make sure you're following the Robot Exclusion Standard exactly.

If your web server is configured to block access to /robots.txt, the crawler won't be able to read it and will assume access to your entire site is disallowed.

8. I'd like to filter my logs, what IP addresses does Voyager crawl from?

We recommend that you use the user-agent string to filter Voyager's crawl. Voyager's IP address will vary with time.

9. What other user-agents are/were used by Kosmix Crawler?

cfetch/1.0 and voyager-hc/1.0

On November 21st, 2005 we changed the name of our crawler from cfetch/1.0 to voyager/1.0 to be inline with our Company naming scheme.

voyager-hc/1.0 was used for a test project during November 2007-January 2008.

Please use voyager/1.0 in your /robots.txt file to specify rules for our Crawler.

10. Why is Voyager retrieving the same page on my site multiple times?

Voyager keeps track of how frequently pages change so that it can maintain a fresh copy of each page. Pages that change frequently, get crawled frequently.

11. I have additional questions or comments about Voyager, who should I contact?

Please contact us with questions.