hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <flame...@gmail.com>
Subject Re: Jakarta HttpXXX Charter
Date Fri, 02 Sep 2005 09:05:33 GMT
On 9/2/05, Oleg Kalnichevski <olegk@apache.org> wrote:
> On Thu, Sep 01, 2005 at 10:30:29PM -0400, Henri Yandell wrote:
> > Never got round to adding it to Commons, robots.txt parser:
> >
> > http://www.osjava.org/norbert/ -> http://www.robotstxt.org/wc/norobots-rfc.html
> >
> > Web-spider:
> >
> > http://www.osjava.org/scraping-engine/
> >
> > HTML pseudo-scraper (probably more for Jakarta Silk/Web Components):
> >
> > http://www.osjava.org/genjava/multiprojects/gj-scrape/ (poor site at
> > the moment, it's a substring()/indexOf() parsing system instead of
> > trying to be fancy).
> >
> > Hen
> >
> 
> Henri,
> 
> I think a web spider and robots.txt parser would be a welcome addition
> to the project. If you are personally interested in porting these
> applications to use HttpClient / Http Components go ahead and add the
> web spider to the project goals and yourself to the list of intitial
> committers. In my opinion voting you in to a committer status is a
> matter of formality

The robots.txt parser has a single GET request currently using
HttpUrlConnection, so moving this to use HttpClient is pretty easy (if
even thought necessary, adding the dependency for one method call is
usually overkill). Will go ahead and add this to the list as it has
very little religion.

The web-spider might want a bit more investigation on the community's
part. It had its guts ripped out to form a kind of container project
called oscube so has a dependency on that, and might be scoped a bit
beyond what Http Components would want from a spider. Cron via Quartz,
notification, database storing etc. It already uses HttpClient for its
fetching there (along with Commons Net for FTP).

http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/scraping-engine/xdocs/manual/images/Scrapers.png?rev=1967&view=auto

So a bit more than the simple wget clone that might have been
envisioned.  :) Plan is to add a mini-scraping language to it, support
POP and possibly end up with some kind of rules engine/job language. A
lot of religion for HttpClient to swallow, but it is there if it
piques interest.

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Mime
View raw message