lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Announcement: Boilerplate removal library
Date Mon, 14 Dec 2009 22:52:01 GMT

: working with such a setup for a long time now). Integrating it into an 
: Analyzer should be fairly simple as Boilerpipe can return a string which 
: in turn can be parsed just any other text.

treating the boilerplate removal library as a black box String->String 
transformation seems fairly trivial and could easily be done by 
java applications prior to constructing an Analyzer (ie: 
String->[boilerblackbox]->String->[Analyzer]->TokenStream)

Where things wold probably get more complicated is trying to maintaing 
term position information from the orriginal source text source text (for 
things like search result highlighting and whatnot) which would probably 
require doing the boilerplate removal via something like the CharFilter 
abstraction (or directly in a tokenizer).

Does the code as currently implemented maintain position 
mapping information?


-Hoss


Mime
View raw message