lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1410) PFOR implementation
Date Sat, 08 Nov 2008 10:38:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645973#action_12645973
] 

Paul Elschot commented on LUCENE-1410:
--------------------------------------

I've been at this quite irregularly. I'm trying to give the PFor class a more OO interface
and to get exception patching working at more decent speeds. In case someone else wants to
move this forward faster than it is moving now, please holler.

After rereading this, and also after reading up a bit on MonetDb performance improvement techniques,
I have few more rants:

Taking another look at the decompression performance figures, and especially the differences
between native C++ and java, it could become worthwhile to also implement TermQuery in native
code.

With the high decompression speeds of FOR/BITS at lower numbers of frame bits it might also
become worthwhile to compress character data, for example numbers with a low number of different
characters.
Adding a dictionary as in PDICT might help compression even further.
This was probably one of the reasons for the column storage discussed earlier, I'm now sorry
I ignored that discussion.
In the index itself, column storage is also useful. One example is the splitting of document
numbers and frequency into separate streams, another example is various offsets for seeking
in the index.

I think it would be worthwhile to add a compressed integer array to the basic types used in
IndexInput and IndexOutput. I'm still strugling with the addition of skip info into a tree
of such compressed integer arrays (skip offsets
don't seem to fit naturally into a column, and I don't know whether the skip size should be
the same as the decompressed array size).
Placement of such compressed arrays in the index should also be aware of CPU cache lines and
of VM page (disk block) boundaries.
In higher levels of a tree of such compressed arrays, frame exceptions would be best avoided
to allow direct addressing, but the leafs could use frame exceptions for better compression.

For terms that will occur at most once in one document more compression is possible, so it
might be worthwhile to add these as a key. At the moment I have no idea how to enforce the
restriction of at most once though.



> PFOR implementation
> -------------------
>
>                 Key: LUCENE-1410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1410
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch,
TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message