Return-Path: Delivered-To: apmail-tomcat-users-archive@www.apache.org Received: (qmail 1816 invoked from network); 9 Feb 2010 17:15:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Feb 2010 17:15:11 -0000 Received: (qmail 54565 invoked by uid 500); 9 Feb 2010 17:15:07 -0000 Delivered-To: apmail-tomcat-users-archive@tomcat.apache.org Received: (qmail 54501 invoked by uid 500); 9 Feb 2010 17:15:06 -0000 Mailing-List: contact users-help@tomcat.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Tomcat Users List" Delivered-To: mailing list users@tomcat.apache.org Received: (qmail 54489 invoked by uid 99); 9 Feb 2010 17:15:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Feb 2010 17:15:06 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of pid@pidster.com does not designate 209.85.212.45 as permitted sender) Received: from [209.85.212.45] (HELO mail-vw0-f45.google.com) (209.85.212.45) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Feb 2010 17:14:55 +0000 Received: by vws7 with SMTP id 7so1106505vws.18 for ; Tue, 09 Feb 2010 09:14:34 -0800 (PST) Received: by 10.220.127.68 with SMTP id f4mr761211vcs.117.1265735673723; Tue, 09 Feb 2010 09:14:33 -0800 (PST) Received: from phoenix.config (94-193-98-41.zone7.bethere.co.uk [94.193.98.41]) by mx.google.com with ESMTPS id 40sm2390109vws.17.2010.02.09.09.14.31 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 09 Feb 2010 09:14:32 -0800 (PST) Message-ID: <4B7197F6.5010702@pidster.com> Date: Tue, 09 Feb 2010 17:14:30 +0000 From: Pid Reply-To: pid@pidster.com Organization: Pidster Inc User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1 MIME-Version: 1.0 To: Tomcat Users List Subject: Re: JSESSIONID and impact on google References: <1265725862.2842.141.camel@mosu.cotroceni.esolutions.ro> <4B71834F.9020400@christopherschultz.net> <4B718840.4000406@pidster.com> <1265733141.2842.190.camel@mosu.cotroceni.esolutions.ro> In-Reply-To: <1265733141.2842.190.camel@mosu.cotroceni.esolutions.ro> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 09/02/2010 16:32, Marian Simpetru wrote: > jsessionid in URLs returned around 79 million search results. Yep. I know they're there. > google search on jsessionid SEO will give you lots of examples. > > On a question asked to google, they reply by explaining the algorithm > (multiple URL with same content -> lower ranking, JSESSIONID=zzz -> > multiple URLS) > > I can see there is a penalty in google webmaster tools. Can't say on > other websites... This I also know. But as I said, it would be surprising *to me* to find that Google weren't trying to filter this type of noise out of their URL indexes. Having thought about it a little more, I would like to add that we implement XML Sitemaps on our site and this may be having an effect on matters. http://sitemaps.org/protocol.php When I look in my logs I can see sequential(ish) requests for URLs from all of the bots hitting our site and they do not have session id parameters appended. Of 68600 URLs appearing in the Google index of the site I have in mind, only 46 match a search for jsessionid and some of those appear because the HTML contains a URL to another site with the parameter present. The total number of URLs referenced in the XML sitemaps is somewhat below the total indexed on this domain and the difference is markedly larger than 46. I, perhaps hastily, have concluded that search engines are somehow storing pages without the session id parameter present in the URL. p > Marian > > On Tue, 2010-02-09 at 16:07 +0000, Pid wrote: >> On 09/02/2010 15:46, Christopher Schultz wrote: >> > -----BEGIN PGP SIGNED MESSAGE----- >> > Hash: SHA1 >> > >> > Marian, >> > >> > On 2/9/2010 9:31 AM, Marian Simpetru wrote: >> >> Google act as a non cookie browser and hence he is served with non >> >> unique URLs (because of session ID is appended to URL). >> > >> > I heard at one point that Google's crawler *did* support cookies. I >> > never verified that, but it sounds like they currently do not support them. >> > >> >> Question is: Is there a way to configure tomcat to only use cookies (not >> >> append jsessionid to URL for cookie0less browsers). >> > >> > It's not a Tomcat configuration, but you can always write a filter like >> > this: >> > >> > public class NoURLRewriteFilter >> > implements Filter >> > { >> > public void doFilter(...) { >> > chain.doFilter(request, new HttpServletResponseWrapper(response) { >> > public String encodeURL(String url) { return url }; >> > public String encodeUrl(String url) { return url }; >> > public String encodeRedirectURL(String url) { return url }; >> > public String encodeRedirectUrl(String url) { return url }; >> > }); >> > } >> > } >> > >> > Now, this will likely cause an explosion in the number of sessions >> > generated by Google's crawler. You might want to couple this with a >> > separate filter (or just create a GoogleCrawlerFilter that does all >> > this) that identifies Google's (and others) user agent and intercepts >> > calls to getSession() and either refuses to create a session (probably >> > not a good idea) or returns a fake session that gets discarded after >> > every request. Another option would be to set the session timeout to >> > something like 10 seconds so the session dies relatively quickly instead >> > of sticking around for a long time, wasting memory. >> > >> >> Maybe a better idea would be that someone from Apache Tomcat should push >> >> to google with some standards tomcat implement in this respect so that >> >> google change the algorithm and not punish with low ranking websites >> >> powered by tomcat. >> > >> > This is not a"Tomcat problem": it's a problem with any site that >> > requires sessions to maintain state on the server. >> > >> > I agree with Chuck: fix your webapp to tolerate Google's crawler, or >> > suffer the consequences. >> > >> > Something else you can do is use a robots.txt file to prevent the >> > crawler from hitting certain URLs. That might help. >> >> I'm not doing anything special, I don't think. >> Google bots hit our site, the session count goes up a bit. >> Google does not include jsessionid in the URLs it indexes. >> >> It may be that the site has been around for long enough that the Google >> algorithms know that we have a session id should be removed from a URL. >> >> It would be surprising to me if Google (et al) was not trying to remove >> PHPSESSIONID and JSESSIONID data from URLs. >> >> >> p >> >> >> > - -chris >> > -----BEGIN PGP SIGNATURE----- >> > Version: GnuPG v1.4.10 (MingW32) >> > Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/ >> > >> > iEYEARECAAYFAktxg08ACgkQ9CaO5/Lv0PBxDACgweTaZAglz476s7TvYo63//2a >> > IgcAoIp0u2ZxOes8fFPuUAoP2FrHk/VN >> > =FjsP >> > -----END PGP SIGNATURE----- >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail:users-unsubscribe@tomcat.apache.org >> > For additional commands, e-mail:users-help@tomcat.apache.org >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail:users-unsubscribe@tomcat.apache.org >> For additional commands, e-mail:users-help@tomcat.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org For additional commands, e-mail: users-help@tomcat.apache.org