lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data
Date Mon, 08 Oct 2012 09:46:04 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471473#comment-13471473
] 

Adrien Grand commented on LUCENE-2810:
--------------------------------------

[~gsingers], I think we can close this issue given that LUCENE-4226 just got committed. Are
you OK with that?
                
> Explore Alternate Stored Field approaches for highly redundant data
> -------------------------------------------------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents
contain a lot of redundant information and end up wasting a lot of space across a large collection
of documents.  For instance, simply compressing a typical log file often results in > 75%
compression rates.  We should explore mechanisms for applying compression across all the documents
for a field (or fields) while still maintaining relatively fast lookup (that being said, in
most logging applications, fast retrieval of a given event is not always critical.)  For instance,
perhaps it is possible to have a part of storage that contains the set of unique values for
all the fields and the document field value simply contains a reference (could be as small
as a few bits depending on the number of uniq. items) to that value instead of having a full
copy.  Extending this, perhaps we can leverage some existing compression capabilities in Java
to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make sense as a
Codec, if and when we have support for changing storage Codecs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message