lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Sun, 08 Nov 2009 00:04:32 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774692#action_12774692
] 

Michael McCandless commented on LUCENE-1458:
--------------------------------------------

Initial results.  Performance is quite catastrophically bad for the MultiTermQueries!  Something
silly must be up....

JAVA:
java version "1.5.0_19"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02)
Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode)


OS:
SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris

||Query||Deletes %||Tot hits||QPS old||QPS new||Pct change||
|body:[tec TO tet]|0.0|body:[tec TO tet]|3.06|0.23|{color:red}-92.5%{color}|
|body:[tec TO tet]|0.1|body:[tec TO tet]|2.87|0.22|{color:red}-92.3%{color}|
|body:[tec TO tet]|1.0|body:[tec TO tet]|2.85|0.22|{color:red}-92.3%{color}|
|body:[tec TO tet]|10|body:[tec TO tet]|2.83|0.23|{color:red}-91.9%{color}|
|1|0.0|1|22.15|23.87|{color:green}7.8%{color}|
|1|0.1|1|19.89|21.72|{color:green}9.2%{color}|
|1|1.0|1|19.47|21.55|{color:green}10.7%{color}|
|1|10|1|19.82|21.13|{color:green}6.6%{color}|
|2|0.0|2|23.54|25.97|{color:green}10.3%{color}|
|2|0.1|2|21.12|23.56|{color:green}11.6%{color}|
|2|1.0|2|21.37|23.27|{color:green}8.9%{color}|
|2|10|2|21.55|23.10|{color:green}7.2%{color}|
|+1 +2|0.0|+1 +2|7.13|6.97|{color:red}-2.2%{color}|
|+1 +2|0.1|+1 +2|6.40|6.77|{color:green}5.8%{color}|
|+1 +2|1.0|+1 +2|6.41|6.64|{color:green}3.6%{color}|
|+1 +2|10|+1 +2|6.65|6.98|{color:green}5.0%{color}|
|+1 -2|0.0|+1 -2|7.78|7.95|{color:green}2.2%{color}|
|+1 -2|0.1|+1 -2|7.11|7.31|{color:green}2.8%{color}|
|+1 -2|1.0|+1 -2|7.18|7.27|{color:green}1.3%{color}|
|+1 -2|10|+1 -2|7.11|7.70|{color:green}8.3%{color}|
|1 2 3 -4|0.0|1 2 3 -4|5.03|4.91|{color:red}-2.4%{color}|
|1 2 3 -4|0.1|1 2 3 -4|4.62|4.39|{color:red}-5.0%{color}|
|1 2 3 -4|1.0|1 2 3 -4|4.72|4.67|{color:red}-1.1%{color}|
|1 2 3 -4|10|1 2 3 -4|4.78|4.74|{color:red}-0.8%{color}|
|real*|0.0|real*|28.40|0.19|{color:red}-99.3%{color}|
|real*|0.1|real*|26.23|0.20|{color:red}-99.2%{color}|
|real*|1.0|real*|26.04|0.20|{color:red}-99.2%{color}|
|real*|10|real*|26.83|0.20|{color:red}-99.3%{color}|
|"world economy"|0.0|"world economy"|18.82|17.83|{color:red}-5.3%{color}|
|"world economy"|0.1|"world economy"|18.64|17.99|{color:red}-3.5%{color}|
|"world economy"|1.0|"world economy"|18.97|18.35|{color:red}-3.3%{color}|
|"world economy"|10|"world economy"|19.59|18.12|{color:red}-7.5%{color}|


> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message