httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shiva Shivakumar <shi...@gmail.com>
Subject Re: apache/sitemaps?
Date Sat, 08 Oct 2005 02:38:19 GMT
Hi, thanks for the responses so far about a potential mod_sitemap OR using
the current sitemap_gen tool. Since there are some questions about why this
is relevant applicable to httpd or apache-running web masters, some thoughts
on where we are coming from (apologies for the longer email):

1. Some of us believe that webservers have two diferent audiences (a)
regular users with browsers who come to a site requesting a single page or
do some small browsing activity, and (b) webcrawlers that visit these
servers to crawl through them and periodically check if pages have changed.
Current webservers are excellent at servicing the 1st kind -- you know a
URL, and you get the page back. However since there is no real support for
crawlers that visit these sites regularly, so crawlers do dumb things like
"follow-links" like a regular random surfer and
"periodically-check-if-page-changed."

2. Why not have a listing service on all webservers, so that crawlers can go
and check somewhere the list of ALL URLs that are available and
corresponding metadata? (Metadata that can be easily computed automatically,
like lastmod date, etc.). This is clearly not a new idea. This is what ftp
servers do :)

What is a Sitemap? A text file (like robots.txt) that gets auto-computed
with a listing of all known URLs and lastmod times. The text file is based
on XML for some structure and as an easy way to have required and optional
attributes. Also, it is structured so that it is scalable from a few to
millions of URLs (without having massive d/l-able files), and has
log-structured semantics to support a variety of use cases (in terms of
generation and updates). Currently, it is structured as a text file (and
using disk space) rather than materializing this at run-time (which could be
expensive) on a request.

If a webserver has an auto-computed sitemap, a crawler can know about all
the list of URls that a webserver has, can crawl them and index the best
relevant pages (instead of random pages that are linked thro hrefs). Also,
the crawler can put less load on the webserver by only requesting pages that
have changed ((for ex, using lastmod date).

A few questions about validity of above argument:
1. Are webcrawlers useful as an audience to a webserver? We think all search
engines in the world should be comprehensive and parse thro all pages. W/o
some sort of listing support like sitemaps, search engines will be
incomplete. (And we could debate if search engines are useful or not, in
terms of getting people to a webserver in the 1st place.)

2. Are webcrawlers sending that much traffic to webservers, compared to a
regular web user? We think there is a lot of crawler activity on the web
right now, and have anecdotal evidence that crawlers are a pretty large
fraction of webserver activity. Am curious what you guys think from your own
experience -- perhaps stats on apache.org <http://apache.org> will be
useful.

comments/insults?
- shiva


On 10/7/05, Joshua Slive <joshua@slive.ca> wrote:
>
> Greg Stein wrote:
>
> > Ignore the mod_sitemap suggestion for now. As Shiva stated in his
> > note, there is also sitemap_gen.py and its related docco [which exists
> > today]. What are the group's thoughts on that?
>
> I think the basic question is: how would this benefit our users? It
> seems like sitemap_gen.py is easy enough to grab from google.
>
> Joshua.
>

Mime
View raw message