lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object
Date Thu, 15 Jan 2009 21:49:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664280#action_12664280
] 

Michael McCandless commented on LUCENE-505:
-------------------------------------------

bq. In my opinion the problem with large indexes is more, that each SegmentReader has a cache
of the last used norms.

I believe when MultiReader.norms is called (as Doug & Yonik said above), the underlying
SegmentReaders do not in fact cache the norms (this is not readily obvious until you scrutinize
the code).  Ie, it's only MultiReader that caches the full array.

But I agree there would be good benefits (not creating fakeNorms) to moving away from byte[]
for norms.  I think an iterator only API might be fine (giving us more freedom on the impl.),
though I would worry about performance impact.

Or we could make a new method to replace norms() that returns null when the field has no norms,
and then Scorers that use this API would handle the null correctly.  We could fix all core/contribs
to use the new API...

Also note that with LUCENE-1483, we are moving to searching each segment at a time, so MultiReader.norms
should not normally be called, unless it doesn't expose its underlying readers.

> MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-505
>                 URL: https://issues.apache.org/jira/browse/LUCENE-505
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06)
>            Reporter: Steven Tamm
>            Priority: Minor
>         Attachments: LazyNorms.patch, NormFactors.patch, NormFactors.patch, NormFactors20.patch
>
>
> MultiReader.norms() is very inefficient: it has to construct a byte array that's as long
as all the documents in every segment.  This doubles the memory requirement for scoring MultiReaders
vs. Segment Readers.  Although this is cached, it's still a baseline of memory that is unnecessary.
> The problem is that the Normalization Factors are passed around as a byte[].  If it were
instead replaced with an Object, you could perform a whole host of optimizations
> a.  When reading, you wouldn't have to construct a "fakeNorms" array of all 1.0fs.  You
could instead return a singleton object that would just return 1.0f.
> b.  MultiReader could use an object that could delegate to NormFactors of the subreaders
> c.  You could write an implementation that could use mmap to access the norm factors.
 Or if the index isn't long lived, you could use an implementation that reads directly from
the disk.
> The patch provided here replaces the use of byte[] with a new abstract class called NormFactors.
 
> NormFactors has two methods on it
>     public abstract byte getByte(int doc) throws IOException;  // Returns the byte[doc]
>     public float getFactor(int doc) throws IOException;            // Calls Similarity.decodeNorm(getByte(doc))
> There are four implementations of this abstract class
> 1.  NormFactors.EmptyNormFactors - This replaces the fakeNorms with a singleton that
only returns 1.0
> 2.  NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for backwards compatibility
in constructors.
> 3.  MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent the need
to construct the gigantic norms array.
> 4.  SegmentReader.Norm - Same class, but now extends NormFactors to provide the same
access.
> In addition, Many of the Query and Scorer classes were changes to pass around NormFactors
instead of byte[], and to call getFactor() instead of using the byte[].  I have kept around
IndexReader.norms(String) for backwards compatibiltiy, but marked it as deprecated.  I believe
that the use of ByteNormFactors in IndexReader.getNormFactors() will keep backward compatibility
with other IndexReader implementations, but I don't know how to test that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message