lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5609) Should we revisit the default numeric precision step?
Date Sun, 20 Apr 2014 17:38:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975204#comment-13975204
] 

Robert Muir commented on LUCENE-5609:
-------------------------------------

{quote}
I think the main problem of this issue is, that we only have one default. Sombeody never doing
any ranges does not need the additional terms at all. That's the main problem. Solr is better
here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug.
{quote}

Well, I kind of agree, but in a different way. 

In my opinion the default numeric types (intfield, longfield, floatfield, doublefield) should
have good defaults for general-purpose use. This includes range queries: they should work
"reasonably" well out of box. Users that dont need range queries can optimize by changing
to Infinity. Along the same lines, they also dont need to be super-optimized for "hardcore"
esoteric uses of range queries. Thats what defaults are, just making the right tradeoffs for
out-of-box use. 

I would not be happy if these fields default to precisionStep=Infinity either, because thats
also a bad default for general purpose use, just in the opposite direction of precisionStep=4.

I am fine with precisionStep=8 as the new default for both, but I don't think its the best
idea. I think 16 for the 64-bit types are nice because its easy to understand "4 terms for
each value". Today its 8 terms for each value (32-bit field), and 16 terms for each value
(64-bit field). 

I also think we should be able to add new types in the future (e.g. 16-bit short and half-float)
and give them different defaults too. So, I don't understand the need for a "one-size-fits-all"
default.


> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
>                 Key: LUCENE-5609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Michael McCandless
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-5609.patch
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache).  And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
>   * lat/lng (double)
>   * modified time, elevation, population (long)
>   * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep        Size        IndexTime
>        4   1812.7 MB        651.4 sec
>        8   1203.0 MB        443.2 sec
>       16    894.3 MB        361.6 sec
> searching:
>      Field  PrecStep   QueryTime   TermCount
>  geoNameID         4   2872.5 ms       20306
>  geoNameID         8   2903.3 ms      104856
>  geoNameID        16   3371.9 ms     5871427
>   latitude         4   2160.1 ms       36805
>   latitude         8   2249.0 ms      240655
>   latitude        16   2725.9 ms     4649273
>   modified         4   2038.3 ms       13311
>   modified         8   2029.6 ms       58344
>   modified        16   2060.5 ms       77763
>  longitude         4   3468.5 ms       33818
>  longitude         8   3629.9 ms      214863
>  longitude        16   4060.9 ms     4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations.  TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16?  Or both to 16?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message