lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Sentence detection/extraction as Tokenizer?
Date Sat, 28 Nov 2009 12:30:04 GMT
Hi Otis

I've implemented sentence detection as part of my tokenizer, and it does not
extract sentences, but "detecs" EOS (based on several characters from the
UNICODE spec). Upon detection, it returns a Token of EOS type. I then have a
EOS Filter which can be configured w/ appropriate behavior as to what to do
with it for example, set posIncr to 100 on the next token, to avoid
phrase/fuzzy searches find matches across sentences, but there are other
reasons as well such as highlighting.

So I would, personally, not think of EOS detection as  a Tokenizer in and on
itself, but rather as a capability of a Tokenizer (Standard?).

Shai

On Fri, Nov 27, 2009 at 8:07 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hello,
>
> The contrib/wordnet package contains an AnalyzerUtil class with a method
> that extracts sentences from text/String.  It is super-simplistic, so
> probably not very accurate, but I am wondering if *conceptually* it would
> make sense to have a Tokenizer that extracts sentences?  I suppose that
> means each Token would be a complete sentence.
>
> Would you say it makes sense to implement sentence detection/extraction as
> a Tokenizer?
>
> Thanks,
> Otis
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message