hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From edward choi <mp2...@gmail.com>
Subject Re: Question from a Desperate Java Newbie
Date Thu, 16 Dec 2010 06:14:39 GMT
I totally obey the robots.txt since I am only fetching RSS feeds :-)
I implemented my crawler with HttpClient and it is working fine.
I often get messages about "Cookie rejected", but am able to fetch news
articles anyway.

I guess the default "java.net" client is the stateful client you mentioned.
Thanks for the tip!!


2010년 12월 16일 오전 2:18, Steve Loughran <stevel@apache.org>님의 말:

> On 10/12/10 09:08, Edward Choi wrote:
> > I was wrong. It wasn't because of the "read once free" policy. I tried
> again with Java first again and this time it didn't work.
> > I looked up google and found the Http Client you mentioned. It is the one
> provided by apache, right? I guess I will have to try that one now. Thanks!
> >
> httpclient is good, HtmlUnit has a very good client that can simulate
> things like a full web browser with cookies, but that may be overkill.
> NYT's read once policy uses cookies to verify that you are there for the
> first day not logged in, for later days you get 302'd unless you delete
> the cookie, so stateful clients are bad.
> What you may have been hit by is whatever robot trap they have -if you
> generate too much load and don't follow the robots.txt rules they may
> detect this and push back

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message