lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4599) Compressed term vectors
Date Sat, 08 Dec 2012 16:09:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527180#comment-13527180
] 

Michael McCandless commented on LUCENE-4599:
--------------------------------------------

bq. Does it make sense to put this in an FST where the key is the term bytes and the value
is what you're doing now for the positions, offsets, and payloads in a byte array? 

That's a neat idea :)  We should [almost] just be able to use MemoryPostingsFormat, since
it already stores all postings in an FST.

bq. I think a FST would not compress as much as what LZ4 or Deflate can do? But maybe it could
speed up TermsEnum.seekCeil on large documents so it might be an interesting idea regarding
random access speed?

Likely it would not compress as well, since LZ4/Deflate are able to share common infix fragments
too, but FST only shares prefix/suffix.  It'd be interesting to test ... but we should explore
this (FST-backed TermVectorsFormat) in a new issue I think ... this issue seems awesome enough
already :)

bq. Or... can we simply reference the terms by ord (an int) instead of writing each term bytes?

Using ords matching the main terms dict is a neat idea too!  It would be much more compact
... but, when reading the term vectors we'd need to resolve-by-ord against the main terms
dictionary (not all postings formats support that: it's optional, and eg our default PF doesn't),
which would likely be slower than today.

bq. Is that information available somewhere when writing/merging term vectors?

Unfortunately, no.  We only assign ords when it's time to flush the segment ... but we write
term vectors "live" as we index each document.  If we changed that, eg buffered up term vectors,
then we could get the ords when we wrote them.
                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message