tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Len Popp" <len.p...@gmail.com>
Subject Re: Web spiders - disabling jsessionid
Date Fri, 01 Dec 2006 19:02:55 GMT
On 12/1/06, Christopher Schultz <chris@christopherschultz.net> wrote:
> Mikolaj,
>
> Back to the original question...
>
> Mikolaj Rydzewski wrote:
> > As you may know url rewriting feature is not a nice thing when spiders
> > come to index your site -
> > http://gabrito.com/post/javas-seo-blunder-jsessionid.
>
> So, the problem is that your URLs contain ";jsessionid=...", right? When
> does that become a problem?
>
> That becomes a problem when google (or whomever) crawls your site on
> different days and sees the same content with "different" URLs. Well, I
> have a couple of thoughts about that.
>
> 1. A semi-colon is listed in the HTTP specification as being a valid
>    delimiter, despite pretty much every major web server out there
>    ignoring it and thinking that it's part of the path.
>    This is partially the crawler's fault for not following the HTTP
>    specification. The ";" character is not technically a valid URL
>    character outside of it's role as a delimiter, just like "&" or "?".

Whether or not you consider it part of the URL, Google treats it that
way, and so we have to live with it.

> 2. If you strip-off the jsessionid argument for all of these URLs,
>    you will end up with thousands of sessions being created for
>    each URL requested by the google bot. Do you think that's a good
>    idea?

As far as I can see, that's not a problem - I don't get anywhere near
a thousand live sessions from Google. In fact, Google's crawler seems
to limit itself to about one page per minute (according to my logs) so
there won't be more than a few dozen sessions at most.

> 3. If you don't want googlebot to get a session, why are you allocating
>    one? If you need sessions to manage site navigation, then you
>    cannot turn them off and have things work correctly... can you?

On my site (as on many others) you can browse the site without a
session, but if you want to log in (to add content or to use
personalized settings) you need a session. Sessions aren't required
for site navigation or crawling, but they are required for other
reasons.

> 4. Consider instructing googlebot not to crawl certain portions of your
>    site (those which require a session) by using a robots.txt file.

Not an option if that would mean not indexing the interesting parts of the site.

The best solution I could find is to use a filter and
HttpServletResponse wrapper, as others have described. An
implementation of the wrapper class can be found here:
http://mail-archives.apache.org/mod_mbox/struts-user/200311.mbox/%3c3FBA8A39.8010907@fiskars.com%3e
The result is that to login to the site you need a browser that
supports session cookies, but I can accept that. And, Google can now
index the site without crawling all over it repeatedly with different
jsessionids. Yay.
-- 
Len

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message