lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (JIRA)" <>
Subject [jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream
Date Tue, 03 Feb 2015 23:57:36 GMT


Christian Moen commented on LUCENE-6216:

Thanks, Robert.

I had the same idea and I tried this out last night.  The advantage of the approach is that
we only read the buffer data for the token attributes we use, but it leaves the API a bit
slightly awkward in my opinion since we would have both a {{setToken()}} and a {{setPartOfSpeech()}}.
 That said, this is still perhaps the best way to go for performance reasons and these APIs
being very low-level and not commonly used.

For the sake of exploring an alternative idea; a different approach could be to have separate
token filters set these attributes.  The tokenizer would set a {{CharTermAttribute}}, etc.
and a {{JapaneseTokenAttribute}} (or something suitably named) that holds the {{Token}}. 
A separate {{JapanesePartOfSpeechFilter}} would be responsible for setting the {{PartOfSpeechAttribute}}
by getting the data from the {{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd
still need logic similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't
think we gain anything by taking this approach, and it's a big change, too.

> Make it easier to modify Japanese token attributes downstream
> -------------------------------------------------------------
>                 Key: LUCENE-6216
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>            Priority: Minor
> Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, {{BaseFormAttribute}},
etc. get their values from a {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}}
method.  This makes it cumbersome to change these token attributes later on in the analysis
chain since the {{Token}} instances are difficult to instantiate (sort of read-only objects).
> I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would be appropriate
to update token attributes to also reflect Japanese number normalization.
> I think it might be more practical to allow setting a specific value for these token
attributes directly rather than through a {{Token}} since it makes the APIs simpler, allows
for easier changing attributes downstream, and also supporting additional dictionaries easier.
> The drawback with the approach that I can think of is a performance hit as we will miss
out on the inherent lazy retrieval of these token attributes from the {{Token}} object (and
the underlying dictionary/buffer).
> I'd like to do some testing to better understand the performance impact of this change.
Happy to hear your thoughts on this.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message