Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 12CB610AA9 for ; Mon, 21 Oct 2013 17:22:46 +0000 (UTC) Received: (qmail 87806 invoked by uid 500); 21 Oct 2013 17:22:43 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 87731 invoked by uid 500); 21 Oct 2013 17:22:42 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 87426 invoked by uid 99); 21 Oct 2013 17:22:42 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Oct 2013 17:22:42 +0000 Date: Mon, 21 Oct 2013 17:22:42 +0000 (UTC) From: "Keith Turner (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-1787) support two tier compression codec configuration MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800832#comment-13800832 ] Keith Turner commented on ACCUMULO-1787: ---------------------------------------- Could possibly give control over what compression library to use for output files to CompactionStrategy in ACCUMULO-1451. Then you could write a compaction strategy that uses different compression algorithms at different times. > support two tier compression codec configuration > ------------------------------------------------ > > Key: ACCUMULO-1787 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1787 > Project: Accumulo > Issue Type: Improvement > Reporter: Adam Fuchs > Attachments: ci_file_sizes.png > > > Given our current configuration of one compression codec per table we have the option of leaning towards performance with something like snappy or leaning towards smaller footprint with something like gzip. With a change to the way we configure codecs we might be able to approach the best of both worlds. Consider the difference between files that have been written by major or minor compactions and files that exist at any given point in time. For better footprint on disk we care about the latter, but for total CPU usage over time we care about the former. The two distributions are distinct because Accumulo deletes files after major compactions. If we figure out whether a file is going to be long-lived at the time we write it then we can pick the compression codec that optimizes the relevant concern. > One way to distinguish is by file size. Accumulo writes many small files and later major compacts those away, so the distribution of written files is skewed towards smaller files while the distribution of files existing at any point in time is skewed towards larger files. I recommend for each table we support a general compression codec and a second codec for files under a configurable size. -- This message was sent by Atlassian JIRA (v6.1#6144)