lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4120) FST should use packed integer arrays
Date Mon, 11 Jun 2012 21:25:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293100#comment-13293100
] 

Adrien Grand commented on LUCENE-4120:
--------------------------------------

bq. It seems sort of odd to have the new .save method on ReaderImpl... can it be on Mutable/Impl
instead, or, maybe FST does its own saving or something?

My first intent was to add this method to {{Mutable}}. The problem is that {{nodeRefToAddress}}
needs to be a reader since it may be instantiated through {{PackedInts.getReader}}, but it
also might need to be serialized because of the {{save}} method. This is why I added this
method to {{Reader}}. I can switch this method to {{Mutable}} but this means that it won't
be possible to {{save}} a {{FST}} read from disk anymore (maybe not a problem?). Another solution
could be to move the serialization logic to {{FST}} but this would require to expose some
internals of the packed integer arrays to select the right format ({{PACKED}} or {{PACKED_SINGLE_BLOCK}}
depending on whether the reader/mutable is an instance of {{Packed64SingleBLock}}) but I would
really like to avoid this as long as possible.

bq. In all the places we now pass random.nextFloat() for acceptableOverheadRatio (to FST.pack
or MemoryPostingsFormat), shouldn't it be COMPACT .. FASTEST instead of 0.0 .. 1.0?

0..1 gives more chances to different implementations to be selected. {{FASTEST=7}} is only
useful for {{bitsPerValue=1}} so that a {{Direct8}} is instantiated. If we used an uniformly
distributed float between {{COMPACT=0}} and {{FASTEST=7}}, a {{Direct*}} implementation would
be used more than 6/7 of the time when {{bitsPerValue>=4}}. For example, if {{bitsPerValue=15}},
a {{Direct16}} will be instantiated if {{acceptableOverheadRatio>=1/15=0.07}} and a {{Packed64}}
otherwise. A lower upper bound for {{acceptableOverheadRatio}} makes the latter case more
likely.

bq. [kuromoji], [getWriterByFormat], [javadocs]

Agreed, working on it.


                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer
array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage
of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message