lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: TrieRange
Date Sat, 07 Feb 2009 11:04:11 GMT
Hi Yonik,

I combine both mails into one:

> >> Encoding a slice per character makes the code simpler, but increases
> >> the size of the index... but perhaps not enough to worry about in
> >> practice?
> >
> > This is correct. For 2bit and 4bit there is a lot of overhead by this,
> but
> > there is no way round (any ideas how to fix this?). But 8bit is the most
> > compact one. There needs to be more testing and benchmarking.
> Separate bit slicing and String encoding.... they are independent.
> If a,b,c,d are prefix codes designating precision, and w,x,y,z are
> each 2 bits of the number, then
> ax
> bxy
> cxyz
> dxyzw
> Everything after the prefix can be encoded in a single character in each
> case.
> Lucene's prefix encoding of the index will remove some of the
> redundancy... buy only for numbers that are very packed together.

This is a really great idea, to bad, that I got it not before christmas.
This change would require an almost complete rewrite of TrieUtils and partly
of TrieRangeFilter. The idea would be to change the following:

- For encoding the unsigned longs use the same encoding as Solr's
NumberUtils does (because of the trie encoded fields, it must be a little
bit different structured, but the encoding would be identical, only the java
code different), which is even more compacted as the current 8bit encoding.
This would be a good place to merge the NumberUtils again with Lucene. This
encodings would be stored in the main field (and optionally stored). The
encoding is always used, even for 8bit, 4bit and 2bit (it's always the
same). This makes decoding of stored trie encoded fields and sorting
simplier (you only need one en-/decoder). No more need for the current auto
detect decoders.

- The additional trie encoded lower precision slices (stored in an extra
field, there is no possibility anymore to store it in same field, more to
that later) are generated not on the character string side, they are
generated before encoding. The first character still specifies the precision
(slice length). The encoded field comes after that, but the lower precision
is not generated by stripping chars from the right, it is just a binary
operation (anding with a mask) on the original unsigned long (just remove
the lower bits by setting to 0). The manipulated long is then encoded in the
same way, prefixed with the length char. An optimization might be to remove
the lower 0 bits from the string, but it would not be needed. The strings
are unique for one precision (no difference between 0-bits there or not).

- TrieRangeFilter would do the same as before but with some changes: It
would not split the range using the string encoded field, it would also use
the unsigned long, anding the bits. The only thing to change is
increment/decrementTrieCoded() from TrieUtils which then must operate on the
longs, but this can be done easily. This method is needed for fitting the
ranges against each other and exclusive ranges.

I think about it some more days and will write a patch/rewrite and test a
little bit. The tests have to pass for TestTrieRangeQuery, if they does, all
is OK :)

This change would require a complete API change (as you proposed), the only
problem is, I pointed a lot of people to the new (unreleased code). They
have to reindex their docs and change the code. So my question: Is anybody
using TrieRangeQuery at the moment?

> > The encoding of the values
> > into two different field names does the trick for the whole range query.
> > Removing the code that generates the field in exactly that way would
> remove
> > the idea behind TrieRangeFilter.
> Allowing the ability to specify the exact name of both fields (or
> specify both names as the same)?

With the change above, there is no way to use the same field name (currently
the two fields can be differentiated using the character value of the first
cahr, if <0x100 it's a prefixed field, if >=0x100 it's a raw value. But this
could be changed if also the highest precision is prefixed in index (not for
the stored field, but for the indexed ones). If I would keep the highest
precision without prefix, the fields cannot merged, but a traditional
RangeQuery using the NumberUtils encoding would work!!! (and the test case,
comparing both...).

> > The only thing that could be changed would
> > be to make the suffix on the helper field variable. There is lot of
> > optimization behind this (see TrieRangeFilter comments in the central
> > splitRange method). This depends on the order and so the names of the
> field
> > and its helper.
> Are you saying that the actual field names cause a predictable
> performance difference?  This should not be the case.

The field names could be changed, sure, the small performance optimization
is in TrieRangeFilter: The splitting of the range is done in a way, to not
seek back and forward in the Term list, just in forward direction. This is
only possible if the helper field is ordered *after* the original one. But
this can be detected by string comparison of the field names and then the
order of splitting is changed.

> > Very early versions of the trie algorithm was using a extra
> > field name for each slice, as you call it, I removed it completely later
> (no
> > helper field at all), but then trie fields could not be sorted any more.
> The
> > extra helper field is the way in the middle.
> >
> > Maybe there should be more documentation about that by a more native
> > speaker.
> I understand how it works and how one would need to configure it such
> that it be sortable if needed - but my point was really much more
> about allowing people to do things differently if needed.

Propose an API to generate the document fields, which should be simple for
beginners (like now), but flexible!

One possibility would be to generate the main field and helper fields
separate. The helper fields without special configutation (no norms, no
store, not tf). The main field is generated with all possibilities. Maybe I
split the method into two ones, one that only generates the helper field
(with any name), and the other one like now (for easy usage). If somebody
uses the first one (only helper), he needs to generate the main field with
all options (norms, tf,..) for himself (in a compatible way).

> >> For example, whether to encode values in two fields and exactly what
> >> those fields are named, seems like it should be under developer
> >> control.  Likewise, developers should be in control of creating and
> >> adding fields to documents and setting other properties like
> >> omitTerms, omitNorms, etc.
> >
> > In my opinion there is no sense in having norms or such things on trie
> > fields. They should only be queried using TrieRangeFilter/Query, for
> that
> > norms and TF are not needed (as they are numerical values, for what do
> you
> > need the norms?).
> norms: they also fold in index time boosts.

Currently norms are switched of for trie fields, this should optimize
indexing time. Why switch them on for trie fields?

> What if someone wanted to put payloads or use some other future
> indexing method on the terms?

See above.

> It's more about not forcing decisions on all developers.
> There would be no way for me to incorporate the Trie stuff into Solr
> as it stands - I'd need to develop custom code that duplicated code
> from TrieUtils because not enough flexibility is exposed.  I'm not
> adverse to doing so - just pointing out the downsides.

Wait for my next code proposal, and lets look into it. Give me some time,
this is not so simple to change, but would be effective. I do not like all
your proposals (like switching on/of norms/tf/...), but payloads may be
good. Maybe you just throw a IllegalArgumentException in Solr, when somebody
tries to modify norms/tf for numeric fields that are trie encoded.

I hope you like the new ideas. What I would not like is a fork of
TrieRangeQuery in Solr, this would make maintenance harder. Let's make it
better in Lucene Contrib (and that before 2.9, after that I cannot change
the API/encoding anymore).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message