lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util
Date Mon, 25 Jan 2010 21:23:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804723#action_12804723
] 

Michael McCandless commented on LUCENE-1990:
--------------------------------------------

Good progress !

bq. I think Michaels generated code was meant as a temporary solution, until a handcrafted
version was available

Actually that was intended to be a fast impl... the switch should be
compiled to a direct lookup (maybe plus a conditional to catch the
"default" case even though it will never happen...ugh).  But I like
your impl with no conditional at all.  We should test both.

bq. As to whether to use int or long in the interface unsigned packed int, the only numbers
that will probably need to be long in the foreseeable future are docids.

Also the file offsets into the terms dict, possibly the offsets in RAM
into the terms dict character data (UTF8 byte[]).  Also, when we do
column stride fields, we allow storing values > int.  I think we
should stick with {{long get(index)}} for now.

Other comments:

  * Maybe we should move all of this under oal.util.packed?
    (packedints?  ints?)

  * I think we should remove getMaxValue() from the Reader interface?

  * Why create the IMPLEMENTATION enum?  Why not simply return an
    [anonymous] instance of Writer?

  * Why not store bitsPerValue in the header instead of maxValue?  EG
    maybe my maxValue is 7000, but because I'm using directShort,
    bitsPerValue is 16.  Also, the maxValue at write time should not
    have to be known -- eg the factory API should let me ask for a
    direct short writer without declaring the maxValue I will store.

  * I wonder if we should add an optional Object
    getDirectBackingArray().  The packed/aligned impls would return
    null, but the direct byte/short/int/long impls would return their
    array.  This would allow callers to specialize upstream impls to
    do the direct array lookup without the cast-to-long (like how
    FieldComparator now has impls for byte,short,int,long).  I suspect
    for column stride fields, when sorting by an integer field, on a
    32bit arch, this would be a perf win.  But: let's wait until we
    have CSFs, and we can test whether there really is a gain here....

  * I think we shouldn't put a getWriter on every Reader
    impl... because it's a one to many mapping?  Eg the format written
    by PackedWriter can be read by direct byte/short/int/long,
    Packed32/64.

  * For starters I don't think we should make reader impls that can
    read nbits > 31 bits with an int[] backing array.  I think long[]
    backing array is fine.

  * I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
    Can't we have a single enum (STORAGE?) with: packed, aligned32,
    aligned64?  "Direct" is really just packed with nbits rounded up
    to 8,16,32,64.

  * Aligned32/64 is very wasteful for certain nbits... I like the idea
    of "auto" to avoid risk that caller picks a bad combination.

  * I think for starters we should not make any reader impls that do
    remapping at load time.


> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
>                 Key: LUCENE-1990
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1990
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1990-te20100122.patch, LUCENE-1990.patch, LUCENE-1990_PerformanceMeasurements20100104.zip
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message