lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Kohlschütter <kohlschuet...@L3S.de>
Subject Re: Announcement: Boilerplate removal library
Date Mon, 14 Dec 2009 14:23:35 GMT
Hi Ted,

thanks for your email, and sorry for replying so late, I have overlooked your posting.

Adding boilerpipe to Lucene is definitely a good idea (I have been working with such a setup
for a long time now).
Integrating it into an Analyzer should be fairly simple as Boilerpipe can return a string
which in turn can be parsed just any other text.

However it would also be great (in order to increase recall) to also store non-content and
just add some kind of static boosting for content blocks over non-content blocks. I am not
sure whether this will work right now using an Analyzer. What you could do though, is to store
the text into separate fields ("content"/"boilerplate") and add field-specific boosts at query
time.

Cheers,
Christian

Am 04.12.2009 um 22:59 schrieb Ted Dunning:

> Nice paper.  I haven't read the software yet, but I would expect it to have
> similar qualities.
> 
> Have you considered how boilerpipe might be integrated into a Lucene
> analyzer?
> 
> 2009/12/4 Christian Kohlschütter <kohlschuetter@l3s.de>
> 
>> Dear all,
>> 
>> I am happy to announce the release of Boilerpipe 1.0.
>> 
>> Boilerpipe is a Java library for boilerplate removal and fulltext
>> extraction from HTML pages.
>> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>> to be presented at WSDM 2010 -- The Third ACM International Conference on
>> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>> 
>> The boilerpipe library provides algorithms to detect and remove the surplus
>> "clutter" (boilerplate, templates) around the main textual content of a
>> website. It already provides specific strategies for common tasks (for
>> example: news article extraction) and may also be easily extended for
>> individual problem settings. Extracting content is very fast (milliseconds),
>> just needs the input document (no global or site-level information required)
>> and is usually quite accurate.
>> 
>> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>> 
>> The code is released under the Apache 2.0 license and you are very welcomed
>> to use Boilerpipe for whatever you like to. Please let me know if it helps
>> you, if you have questions about it, difficulties with it or ideas how to
>> improve it.
>> 
>> Cheers,
>> Christian
>> 
>> PS: The website already provides version 1.0.1 (now includes the dependency
>> jars in the binary tarball)
>> --
>> Christian Kohlschütter
>> kohlschuetter@L3S.de
>> 
>> Forschungszentrum L3S
>> Leibniz Universität Hannover
>> 
>> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>> 
>> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

-- 
Christian Kohlschütter
kohlschuetter@L3S.de

L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter




Mime
View raw message