accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Fuchs (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-1787) support two tier compression codec configuration
Date Fri, 18 Oct 2013 15:00:54 GMT
Adam Fuchs created ACCUMULO-1787:
------------------------------------

             Summary: support two tier compression codec configuration
                 Key: ACCUMULO-1787
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1787
             Project: Accumulo
          Issue Type: Improvement
            Reporter: Adam Fuchs


Given our current configuration of one compression codec per table we have the option of leaning
towards performance with something like snappy or leaning towards smaller footprint with something
like gzip. With a change to the way we configure codecs we might be able to approach the best
of both worlds. Consider the difference between files that have been written by major or minor
compactions and files that exist at any given point in time. For better footprint on disk
we care about the latter, but for total CPU usage over time we care about the former. The
two distributions are distinct because Accumulo deletes files after major compactions. If
we figure out whether a file is going to be long-lived at the time we write it then we can
pick the compression codec that optimizes the relevant concern.

One way to distinguish is by file size. Accumulo writes many small files and later major compacts
those away, so the distribution of written files is skewed towards smaller files while the
distribution of files existing at any point in time is skewed towards larger files. I recommend
for each table we support a general compression codec and a second codec for files under a
configurable size.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message