lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Becker <christian.frei...@gmail.com>
Subject Re: Developing experimental "more advanced" analyzers
Date Mon, 29 May 2017 18:42:19 GMT
I'm sorry - I didn't write down, that my intention is to have linguistic
annotations like stems and maybe part of speech information. For sure,
tokenization is one of the things I want to do.

2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:

> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> <christian.freisen@gmail.com> wrote:
> > Hi There,
> >
> > I'm new to lucene (in fact im interested in ElasticSearch but in this
> case
> > its related to lucene) and I want to make some experiments with some
> > enhanced analyzers.
> >
> > Indeed I have an external linguistic component which I want to connect to
> > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> code, I
> > want to make sure that I'm going the right way.
> >
> > The linguistic component needs at least a whole sentence as Input (at
> best
> > it would be the whole text at once).
> >
> > So as far as I can see I would need to create a custom Analyzer and
> > overrride "createComponents" and "normalize".
> >
>
> There is a base class for tokenizers that want to see
> sentences-at-a-time in order to divide into words:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> SegmentingTokenizerBase.java#L197-L201
>
> There are two examples that use it in the test class:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> TestSegmentingTokenizerBase.java#L145
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message