lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4563) DirTaxoWriter's Codec - rely on default or use custom?
Date Mon, 19 Nov 2012 14:22:59 GMT


Shai Erera commented on LUCENE-4563:

Not much -- only the top-K facet ordinals are labeled. Also, the taxo-reader holds a cache,
so if you typically label a certain set of categories, the index is likely to never, or very
rarely be hit.

I think that maybe during indexing this might help, but not sure. Also, it may result in a
smaller size taxonomy index.
> DirTaxoWriter's Codec - rely on default or use custom?
> ------------------------------------------------------
>                 Key: LUCENE-4563
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
> Today, DirTaxoWriter opens an IndexWriter using the default Codec. While running tests,
I noticed that some of them take a veeeeery long time to complete, some times. Debugging,
I realized that they use SimpleText codec b/c that's what the test-framework drew at random.
> That got me to think if we should really depend on the default Codec, or use a special
codec that is more suitable for the taxonomy index's unique characteristics. Basically, the
taxonomy index has two fields:
> # One in which the category path is saved, as StringField, and therefore each term is
associated with exactly one document
> # Another field with one term, such that a category's parent is written in the position
of that term for every document.
> Initially, I thought that we should really be using PulsingCodec. After a brief chat
about it w/ Robert, he said that Lucene41 Codec acts like pulsing for fields like that. So
I'm thinking that we should either:
> * Hard-code to Lucene41, if it's indeed useful.
> * Write a completely new Codec, that is special for that case. I.e. Lucene41 may handle
these cases efficiently, but its code needs to be prepared for other cases too, therefore
we may be able to write something more efficient.
> I open that as a placeholder, I think that we should first come up w/ a decent benchmark
test in order to validate the results. The benchmark package now contains some facet related
stuff, so I'll take a look if that's enough.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message