Hello folks,
We (@Google) launched Sitemaps to optimize how crawlers work with
webservers, from a hit-or-miss approach to something more directed.
Currently, webcrawlers (including ours) do not know about all pages on a
webserver, or when they change. (A simple "ls -lR" in the ftp-world, that w=
e
dont have in the web-world). Instead, our crawlers crawl pages that are
linked to from other pages and periodically check if they change, like a
random web surfer.
Some of the key aspects of our proposal include (a) a simple XML protocol w=
e
released under Creative Commons 2.0 license so all webservers, webmasters
and search engines could benefit from a common approach, and (b) an
open-source sitemap generator in Python (@sourceforge) that produces
Sitemaps automatically for some common use cases.
It's been about 4 months since we launched, and webmasters have been using
the Sitemaps protocol (and client) to give us URLs for both small (e.g, 100
urls) to large sites (e.g., 10M+ urls), so we figured it is time to ping yo=
u
guys. How do the Apache webserver folks react to something like Sitemaps
protocol being supported in Apache "out of the box" (e.g., as a mod_sitemap=
)
or shipping the sitemap_gen.py tool (or some variant) thro
http://httpd.apache.org/docs/2.1/programs/<http://www.google.com/url?sa=3DD=
&q=3Dhttp%3A%2F%2Fhttpd.apache.org%2Fdocs%2F2.1%2Fprograms%2F>as
a support program (similar to htdigest or htdbm)? And in general,
offering additional mechanisms for webservers to help webcrawlers (an
increasing fraction of webserver activity) much more directly?
thanks,
- shiva
---------------------------------------------------------------------------=
--------------------------------------------------------------
Some links...
1. About Sitemaps --
http://www.google.com/webmasters/sitemaps/docs/en/about.html<http://www.goo=
gle.com/url?sa=3DD&q=3Dhttp%3A%2F%2Fwww.google.com%2Fwebmasters%2Fsitemaps%=
2Fdocs%2Fen%2Fabout.html>
2. Sitemaps protocol --
http://www.google.com/webmasters/sitemaps/docs/en/protocol.html<http://www.=
google.com/url?sa=3DD&q=3Dhttp%3A%2F%2Fwww.google.com%2Fwebmasters%2Fsitema=
ps%2Fdocs%2Fen%2Fprotocol.html>
3. Google released open source sitemap_gen.py --
http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html
<http://www.google.com/url?sa=3DD&q=3Dhttp%3A%2F%2Fwww.google.com%2Fwebmast=
ers%2Fsitemaps%2Fdocs%2Fen%2Fsitemap-generator.html>
4. Third party sitemap generators for webservers/CMS that currently support
Sitemaps: http://code.google.com/sm_thirdparty.html<http://www.google.com/u=
rl?sa=3DD&q=3Dhttp%3A%2F%2Fcode.google.com%2Fsm_thirdparty.html>
|