mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Lib to pull text from web pages?
Date Thu, 11 Nov 2010 21:51:42 GMT
TIka has boilerpipe, which is not bad for web pages in general. I have
a port of readability, which is better than boilerpipe for news
articles in particular. It seems to me that I should investigate if
Tika has room for both.

On Thu, Nov 11, 2010 at 4:04 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> I believe that this is included in Tika now (according to Ken Krugler)
>
> On Thu, Nov 11, 2010 at 12:37 PM, Isabel Drost <isabel@apache.org> wrote:
>
>> ...
>>
>> As a side note - a project with similar goals was mentioned on the Lucene
>> mailing lists a while ago: http://code.google.com/p/boilerpipe/
>>
>> Cheers,
>> Isabel
>>
>

Mime
View raw message