accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-1787) support two tier compression codec configuration
Date Tue, 30 Aug 2016 18:00:24 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christopher Tubbs updated ACCUMULO-1787:
----------------------------------------
    Fix Version/s: 2.0.0

> support two tier compression codec configuration
> ------------------------------------------------
>
>                 Key: ACCUMULO-1787
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1787
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Adam Fuchs
>            Assignee: Michael Miller
>             Fix For: 2.0.0
>
>         Attachments: AccumuloWatcher.java, ci_file_sizes.png, hybrid.diff, hybrid2.diff
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Given our current configuration of one compression codec per table we have the option
of leaning towards performance with something like snappy or leaning towards smaller footprint
with something like gzip. With a change to the way we configure codecs we might be able to
approach the best of both worlds. Consider the difference between files that have been written
by major or minor compactions and files that exist at any given point in time. For better
footprint on disk we care about the latter, but for total CPU usage over time we care about
the former. The two distributions are distinct because Accumulo deletes files after major
compactions. If we figure out whether a file is going to be long-lived at the time we write
it then we can pick the compression codec that optimizes the relevant concern.
> One way to distinguish is by file size. Accumulo writes many small files and later major
compacts those away, so the distribution of written files is skewed towards smaller files
while the distribution of files existing at any point in time is skewed towards larger files.
I recommend for each table we support a general compression codec and a second codec for files
under a configurable size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message