lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2498) add sentence boundary charfilter
Date Sun, 13 Jun 2010 23:29:14 GMT


Robert Muir commented on LUCENE-2498:

bq. I wonder if it would be possible/make sense to make this a tokenizer instead of a charfilter:
one token per sentence. Then token production would be in a filter stage.

well maybe we should reword the issue, it doesnt have to be a charfilter or even use a special
string to mark the sentence boundaries. But I thought as a charfilter it would allow you to
use your own tokenizer, such as StandardTokenizer along with sentence boundaries.

bq. FWIW, I've implemented sentence boundary support by returning a special token (Type.EOF)
from the Tokenizer and created a EOSFilter which increments the posIncr attribute (set it
to 100).

This sounds similar to what we are proposing here... did you integrate this into your tokenizer

bq. I've used the following to
detect EOS + \u0085 (Next Line (NEL). I'm sure though that there are other markers as well.

Well, there is nothing wrong with this approach. The advantage of using the unicode segmentation
standard for sentences is that it can give some better handling for corner cases, since it
has a grammar.

some examples quoted directly from the spec:

Rules SB6-8 are designed to forbid breaks within strings such as

|... the resp. leaders are ...|
|... etc.)' '(the ...|

They permit breaks in strings such as

|She said "See spot run."|John shook his head. ...|
|... etc.|它们指...|

They cannot detect cases such as "...Mr. Jones..."; more sophisticated tailoring would be
required to detect such cases.

> add sentence boundary charfilter
> --------------------------------
>                 Key: LUCENE-2498
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
> From the discussion of LUCENE-2167:
> It would be nice to have a CharFilter? to mark sentence boundaries.
> Such functionality would be useful for:
> * prevent phrase queries with 0 slop from matching across sentences
> * inhibiting multiword synonyms, or shingles, etc.
> For sentence boundary detection we could use Jflex's support for the Unicode Sentence_Break
property etc,
> and the UAX#29 definition as a default grammar.
> One idea is to just mark the boundaries with a user-provided String.
> As a simple use-case, a user could then add this string to a stopfilter, and it would
introduce a position increment.
> This would inhibit phrase queries, etc.
> a user could use the sentence-markers to do more advanced processing downstream.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message