lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Kohlschütter <kohlschuet...@L3S.de>
Subject Announcement: Boilerplate removal library
Date Fri, 04 Dec 2009 20:33:55 GMT
Dear all,

I am happy to announce the release of Boilerpipe 1.0.

Boilerpipe is a Java library for boilerplate removal and fulltext extraction from HTML pages.
It is based on my paper "Boilerplate Detection using Shallow Text Features"  to be presented
at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining, 3-6
February 2010, New York City, NY USA.

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate,
templates) around the main textual content of a website. It already provides specific strategies
for common tasks (for example: news article extraction) and may also be easily extended for
individual problem settings. Extracting content is very fast (milliseconds), just needs the
input document (no global or site-level information required) and is usually quite accurate.

You can find Boilerpipe at http://code.google.com/p/boilerpipe/

The code is released under the Apache 2.0 license and you are very welcomed to use Boilerpipe
for whatever you like to. Please let me know if it helps you, if you have questions about
it, difficulties with it or ideas how to improve it.

Cheers,
Christian

PS: The website already provides version 1.0.1 (now includes the dependency jars in the binary
tarball)
-- 
Christian Kohlschütter
kohlschuetter@L3S.de

Forschungszentrum L3S
Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter/


Mime
View raw message