lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: header/footer identification and general scaping tools
Date Mon, 28 Jun 2010 20:56:44 GMT
Boris, you might wanna look at http://code.google.com/p/boilerpipe/

simon

On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky
<baleksan@gmail.com> wrote:
> Thanks, Sashi, I am asking more about a general library which will remove
> those HTML element which are unwanted/useless for indexing. For instance, we
> are using a general method to remove headers by comparing the structure of
> HTML on the top-level document from the site (e.g. www.nytimes.com) and the
> page being crawled (which happens to be further down in the hierarchy).
> Generally the difference will be the header or the footer. Is there a
> library out there which contains a collection of hacks like that?
>
> On Mon, Jun 28, 2010 at 1:31 PM, Shashi Kant <skant@sloan.mit.edu> wrote:
>
>> I have used TagSoup to parse the HTML and get the elements of interest.
>> http://ccil.org/~cowan/XML/tagsoup/
>>
>>
>>
>> On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
>> <baleksan@gmail.com> wrote:
>> > I was wondering if any of you know of any open-source solutions for
>> general
>> > issues which arise in web crawling - how do you remove
>> > headers/footers/javascript and generally cleanup html of a web-page
>> before
>> > indexing? We have a first-pass solution implemented using custom code,
>> but
>> > this must be a problem which a lot of people face, so I am asking here.
>> >
>> > Thanks,
>> > Boris
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message