commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <flame...@gmail.com>
Subject Re: [HttpClient] Screen Scraping Components?
Date Mon, 22 Nov 2004 15:07:50 GMT
I'll keep it on list for a bit as they're both bits that I'd like to
see at Apache.

HtmlScraper, which is the main class of use in gj-scrape is all about
pulling the desired data out of a piece of text. An xml parser, regexp
or simple string manipulation would also be of use, I just like the
fact that HtmlScraper speaks the right language for scraping a page.
It doesn't try to parse the page itself, which always makes me worry
that the surface-area of the scraper is too large.

The scraping-engine on the other hand does everything but scrape the
actual page.  You extend a Parser object, add a bit of configuration
and start running it. When you extend the Parser object, you could use
regexp, xml parsing or HtmlScraper.

(Oops, I realise the confusion).

The examples in scraping-engine extend a custom version of Parser
called UrlScraper (nice of me to switch names eh?) which goes ahead
and sets you up to use a HtmlScraper by default. It makes the actual
implemention very simple; for example User Friendly's code is:

        scraper.moveToTagWith("ALT", "Latest Strip");
        return scraper.get("IMG[SRC]");

UrlScraper assumes that the result of the parse will be a URL in
String form, which can then be configured to be stored in a file.

For data, you extend AbstractParser and implement:

public Result parse(Page page, Config cfg, Session session) throws
ParsingException;

UrlScraper; http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/scraping-engine/src/java/org/osjava/scraping/parser/UrlScraper.java?rev=1150&view=auto
is the only online example of this.

Anyways. Got to take a baby to the doctor for a checkup. How did all that sound?

Hen

On Mon, 22 Nov 2004 06:46:00 -0500, Brant Hahn <brant.hahn@insightbb.com> wrote:
> Hi Henri,
> 
> Sorry for the extra email here.  I think I am also particularly interested
> in differentiating between the Page, Fetchers, Parsers, and Scrapers.  I'm
> just trying to distinguish who does what.  I think my primary confusion goes
> between HttpFetcher and HtmlScraper.  What exactly is the difference and in
> what situations would I use one over the other.
> 
> BTW, besides some of the minor confusion I have, I feel that this is what
> I've been looking for.  I'm primarily wanting to use it to log-in to my
> financial accounts and to read my monthly statements that are online.  I
> definteily appreciate you informing me about your components.  Your help is
> greatly appreciated!
> 
> -Brant
> 
> 
> 
> -----Original Message-----
> From: Henri Yandell [mailto:flamefew@gmail.com]
> Sent: Saturday, November 20, 2004 7:06 PM
> To: Jakarta Commons Users List
> Subject: Re: [HttpClient] Screen Scraping Components?
> 
> Couple of components that might be of interest.
> 
> http://www.osjava.org/genjava/multiproject/gj-scrape/
> 
> Firstly a library for scraping a web page. It's a wrapper around
> simple string manipulation aimed to let you specify what you want from
> the page without parsing into an XML tree, or trying to use regex. The
> problem with the XML tree is that it means your scraper hits too much
> of the page and is more instable. Scraping is about minimising the
> surface-area you touch to as little as possible, hopefully just the
> data itself.
> 
> Regex's are useful for grabbing the data once you get close enough,
> but are not the right thing to use to walk through the tags. Gj-Scrape
> is a basic API for walking through a page.
> 
> Secondly, an engine for scraping:
> 
> http://www.osjava.org/scraping-engine/
> 
> A lot of time with scrapers is wasted writing the surrounding code.
> Getting the page, setting up the config in some cron'd way, putting it
> in a db etc. Scraping-engine is everything except for the actual
> parsing of the page, which you custom create using gj-scrape and plug
> in.
> 
> It uses HttpClient for its page-grabbing, and isn't tied to scraping;
> I've a link-checker written using it as the framework. Grabbing the
> cartoon-scraping example is the best way to understand it.
> 
> Hen
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Mime
View raw message