lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5609) Should we revisit the default numeric precision step?
Date Sun, 20 Apr 2014 16:41:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975183#comment-13975183
] 

Uwe Schindler commented on LUCENE-5609:
---------------------------------------

bq. Have a look at LUCENE-1470, even 2 was considered then.

That was not really useable even at that time! The improvements in contrast to 4 were zero.
It was even worse (because the term dictionary got larger, which had impact in 2.x and 3.x.
At that time, I was always using 8 as precisionStep for longs and ints. The same applied for
Solr. Lucene was the only one using 4 as default. ElasticSearch was cloning Lucene's standards.

I would really prefer to use 8 for both ints and longs. The change from 8 to 16 is increasing
the number of terms immense and the index size between 8 and 16 is not really a problem. To
me it has also shown that because of the way how floats/doubles are encoded, the precision
step of 8 is really good for longs. In most cases stuff never changes (like exponent), so
there is exactly one term indexed for that.

With a precision step of 16  I would imagine the differences between 16 and 64 would be neglectible,
too :-) The main reason for having lower precision steps are indexes were the values are equally
distributed. For stuff like values clustered around some numbers, the precisionstep is irrelevant!
In most cases because the way how it works, for larger shifts the indexed value is constant,
so you have one or 2 terms that hit all documents and are never used by the range query..

So before changing the default, I would suggest to have a test with an index that has equally
distributed numbers of the full 64 bit range.

bq. I think 11 is better than 12

...because the last term is better used. The number of terms indexed is the same for 11 and
12 (6*11=66, 6*12=72, but 5*12=60 is too small). But unfortunately this is not a multiple
of 4, so would not be backwards compatible.

I think the main problem of this issue is, that we only have *one* default. Sombeody never
doing any ranges does not need the additional terms at all. That's the main problem. Solr
is better here, as it provided 2 predefined field types, but Lucene only has one - and that
is the bug.

So my proposal: Provide a 2nd field type as a 2nd default with correct documnetation, suggesting
it to users, only wanting to index numeric identifiers or non-docvalues fields they want to
sort on.

And second, we should do LUCENE-5605 - I started with it last week, but was interrupted by
_NativeFSIndexCorrumpter_ :-)  The problem is the precisionStep alltogether! We should make
it an implementation detail. When constructing a NRQ, you should not need to pass it. Because
of this I opened LUCENE-5605, so anybody creating a NRQ/NRF should pass the FieldType to the
NRQ ctor, not an arbitrary number. Then its ensured that the people use the same settings
for indexing and querying.

Together with this, we should provide 2 predfined field types per data type and remove the
constant from NumericUtils completely. The 2 field types per data type might be named like
DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better
names and javadocs). And we should make 8 the new default, which is fully backwards compatible.
And hide the precision step completely! 16 is really too large for lots of queries. And difference
in index size is neglectibale, unless you have a purely numeric index (in which case you should
use a RDBMS instead of an Lucene index to query your data :-) !). Indexing time is also, as
Mike discovered not a problem at all. If people don't reuse the IntField instance, its always
as slow, because the TokenStream has to be recreated on every number. The number of terms
is not the issue at all, sorry!

About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always uses 4 as precStep
if it detects a numeric or date type. ES should change this, but maybe have a bit more intelligent
"guessing". E.g., If you index the "_id" field as an integer, it should automatically use
infinite (DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the "_id"
field. For all standard numeric fields it should use precstep=8.

> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
>                 Key: LUCENE-5609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Michael McCandless
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-5609.patch
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache).  And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
>   * lat/lng (double)
>   * modified time, elevation, population (long)
>   * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep        Size        IndexTime
>        4   1812.7 MB        651.4 sec
>        8   1203.0 MB        443.2 sec
>       16    894.3 MB        361.6 sec
> searching:
>      Field  PrecStep   QueryTime   TermCount
>  geoNameID         4   2872.5 ms       20306
>  geoNameID         8   2903.3 ms      104856
>  geoNameID        16   3371.9 ms     5871427
>   latitude         4   2160.1 ms       36805
>   latitude         8   2249.0 ms      240655
>   latitude        16   2725.9 ms     4649273
>   modified         4   2038.3 ms       13311
>   modified         8   2029.6 ms       58344
>   modified        16   2060.5 ms       77763
>  longitude         4   3468.5 ms       33818
>  longitude         8   3629.9 ms      214863
>  longitude        16   4060.9 ms     4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations.  TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16?  Or both to 16?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message