lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Favre (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3392) Combining analyzers output
Date Tue, 23 Aug 2011 12:17:29 GMT


Olivier Favre commented on LUCENE-3392:

The proposed implementation may a have tight bond with the JVM implementation of some classes
(StringReader, BufferedReader and FilterReader), as they rely on a named private field (respectively
"str", "in" and "in").
This can be avoided, but any Reader should then be fully read and stored as a String or a
char[], which can have a huge overhead.
Considering each clone would get read relatively at the same speed (well, only for word delimiting
analysis, not for a KeywordAnalyzer) an implementation could only retain in memory the portion
read by at least one cloned reader but not all clones, in order to implement a "multi read
head" reader.

Another implementation would be to change the API to give a CloneableReader interface with
a "giveAClone()" function instead of a Reader for tokenStream and reusableTokenStream functions.
But this involves massive refactoring (>13,000 lines) and introduces an important API break.

The proposed implementation is the best solution I found.
Any suggestions are welcome!

> Combining analyzers output
> --------------------------
>                 Key: LUCENE-3392
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Olivier Favre
>            Priority: Minor
>              Labels: analysis
>             Fix For: 3.4
>         Attachments: ComboAnalyzer-lucene3x.patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex
their output, keeping a rough order of tokens like increasing position then increasing start
offset then increasing end offset.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message