lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Developing experimental "more advanced" analyzers
Date Tue, 30 May 2017 08:23:10 GMT

as you are using Elasticsearch, there is no need to implement an Analyzer instance. In general,
this is never needed in Lucene, too, as there is the class CustomAnalyzer that uses a builder
pattern to construct an analyzer like Elasticsearch or Solr are doing.

For your use-case you need to implement a custom Tokenizer and/or several TokenFilters. In
addition you need to create the corresponding factory classes and bundle everything as an
Elasticsearch plugin. I'd suggest to ask on the Elasticsearch mailing lists about this. After
that you can define your analyzer in the Elasticsearch mapping/index config.

The Tokenizer and TokenFilters can be implemented, e.g. like Robert Muir was telling you.
The sentence stuff can be done as a segmenting tokenizer subclass. Keep in mind, that many
tasks can be done with already existing TokenFilters and/or Tokenizers.

Lucene has no index support for POS tags, they are only used in the analysis chain. To somehow
add them to the index, you may use TokenFilters as last stage that adds the POS tags to the
term (e.g., term "Windmill", pos "subject" could be combined in the last TokenFilter to a
term called "Windmill#subject" and indexed like that). For keeping track of POS tags during
the analysis (between the tokenfilters and tokenizers), you may need to define custom attributes.

Check the UIMA analysis module for more information how to do this.


Uwe Schindler
Achterdiek 19, D-28357 Bremen

> -----Original Message-----
> From: Christian Becker []
> Sent: Monday, May 29, 2017 2:37 PM
> To:
> Subject: Developing experimental "more advanced" analyzers
> Hi There,
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
> Is that correct or am I on the wrong track?
> Bests
> Chris

View raw message