lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <sk...@sloan.mit.edu>
Subject Re: header/footer identification and general scaping tools
Date Mon, 28 Jun 2010 20:31:23 GMT
I have used TagSoup to parse the HTML and get the elements of interest.
http://ccil.org/~cowan/XML/tagsoup/



On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
<baleksan@gmail.com> wrote:
> I was wondering if any of you know of any open-source solutions for general
> issues which arise in web crawling - how do you remove
> headers/footers/javascript and generally cleanup html of a web-page before
> indexing? We have a first-pass solution implemented using custom code, but
> this must be a problem which a lot of people face, so I am asking here.
>
> Thanks,
> Boris
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message