lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
Date Fri, 03 Apr 2009 12:22:12 GMT


Michael McCandless commented on LUCENE-1582:

bq. Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot
apply patches to core. Mike, can you do this, when we commit this?

It's fine to include these changes in this patch -- I can commit them all at once.

bq. But as it is needed to generate a TokenStream instance for every numeric value, the GC
cost is about the same for new and old API. Especially because each TokenStream creates a
LinkedHashMap internally for the attributes.

Hmm, we should do some perf tests to see how big a deal this turns out to be.  It'd be nice
to get some sort of reuse API working if performance is really hurt.  (Eg Analyzers can provide
reusableTokenStream, keyed by thread).  You'd presumably have to key on thread & field
name.  If you do this then probably a shortcut helper method should be the preferred way.

bq. Just a question for the indexer people: Is it possible to add two fields with the same
field name to a document, both with a TokenStream? 

Each with a different TokenStream instance, right?  Yes, this should be fine; the tokens are
"logically" concatenated just like multi-valued String fields.

> Make TrieRange completely independent from Document/Field with TokenStream of prefix
encoded values
> ---------------------------------------------------------------------------------------------------
>                 Key: LUCENE-1582
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>         Attachments: LUCENE-1582.patch
> TrieRange has currently the following problem:
> - To add a field, that uses a trie encoding, you can manually add each term to the index
or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed
field configuration
> - TrieUtils currently creates per default a helper field containing the lower precision
terms to enable sorting (limitation of one term/document for sorting)
> - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for
GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved.
> This issue should improve this:
> - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused
by Token API, additional String[] arrays for the encoded result are not created, instead the
TokenStream enumerates the trie values.
> - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...),
so no extra util method needed. By using token filters, one could also add payload and so
and customize everything.
> The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can
extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated
by TermEnum. I will create a "hack" patch for TrieUtils-use only, that uses a non-checked
Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type
can be used (custom parser/iterator implementation for FieldCache). I will attach the field
cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate
patch file, or maybe open another issue for it.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message