lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Announcement: Boilerplate removal library
Date Fri, 04 Dec 2009 21:59:46 GMT
Nice paper.  I haven't read the software yet, but I would expect it to have
similar qualities.

Have you considered how boilerpipe might be integrated into a Lucene
analyzer?

2009/12/4 Christian Kohlschütter <kohlschuetter@l3s.de>

> Dear all,
>
> I am happy to announce the release of Boilerpipe 1.0.
>
> Boilerpipe is a Java library for boilerplate removal and fulltext
> extraction from HTML pages.
> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>  to be presented at WSDM 2010 -- The Third ACM International Conference on
> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>
> The boilerpipe library provides algorithms to detect and remove the surplus
> "clutter" (boilerplate, templates) around the main textual content of a
> website. It already provides specific strategies for common tasks (for
> example: news article extraction) and may also be easily extended for
> individual problem settings. Extracting content is very fast (milliseconds),
> just needs the input document (no global or site-level information required)
> and is usually quite accurate.
>
> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>
> The code is released under the Apache 2.0 license and you are very welcomed
> to use Boilerpipe for whatever you like to. Please let me know if it helps
> you, if you have questions about it, difficulties with it or ideas how to
> improve it.
>
> Cheers,
> Christian
>
> PS: The website already provides version 1.0.1 (now includes the dependency
> jars in the binary tarball)
> --
> Christian Kohlschütter
> kohlschuetter@L3S.de
>
> Forschungszentrum L3S
> Leibniz Universität Hannover
>
> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message