lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Developing experimental "more advanced" analyzers
Date Mon, 29 May 2017 17:02:55 GMT
On Mon, May 29, 2017 at 8:36 AM, Christian Becker
<christian.freisen@gmail.com> wrote:
> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>

There is a base class for tokenizers that want to see
sentences-at-a-time in order to divide into words:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/SegmentingTokenizerBase.java#L197-L201

There are two examples that use it in the test class:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestSegmentingTokenizerBase.java#L145

Mime
View raw message