lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Aleksandrovsky <balek...@gmail.com>
Subject Re: header/footer identification and general scaping tools
Date Mon, 28 Jun 2010 20:48:29 GMT
Thanks, Sashi, I am asking more about a general library which will remove
those HTML element which are unwanted/useless for indexing. For instance, we
are using a general method to remove headers by comparing the structure of
HTML on the top-level document from the site (e.g. www.nytimes.com) and the
page being crawled (which happens to be further down in the hierarchy).
Generally the difference will be the header or the footer. Is there a
library out there which contains a collection of hacks like that?

On Mon, Jun 28, 2010 at 1:31 PM, Shashi Kant <skant@sloan.mit.edu> wrote:

> I have used TagSoup to parse the HTML and get the elements of interest.
> http://ccil.org/~cowan/XML/tagsoup/
>
>
>
> On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
> <baleksan@gmail.com> wrote:
> > I was wondering if any of you know of any open-source solutions for
> general
> > issues which arise in web crawling - how do you remove
> > headers/footers/javascript and generally cleanup html of a web-page
> before
> > indexing? We have a first-pass solution implemented using custom code,
> but
> > this must be a problem which a lot of people face, so I am asking here.
> >
> > Thanks,
> > Boris
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message