Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 815FD10D3F for ; Wed, 30 Oct 2013 03:05:44 +0000 (UTC) Received: (qmail 77499 invoked by uid 500); 30 Oct 2013 03:05:44 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 77265 invoked by uid 500); 30 Oct 2013 03:05:38 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 77256 invoked by uid 99); 30 Oct 2013 03:05:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Oct 2013 03:05:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eric.newton@gmail.com designates 209.85.216.45 as permitted sender) Received: from [209.85.216.45] (HELO mail-qa0-f45.google.com) (209.85.216.45) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Oct 2013 03:05:32 +0000 Received: by mail-qa0-f45.google.com with SMTP id ii20so3444731qab.11 for ; Tue, 29 Oct 2013 20:05:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=5dNXOwyOyiYO0a0/GsOFV8k5lLCrnlOc+Z7rZ3gS1P0=; b=ON3Qoi2aPeqF+0ReBMzuU/eGWz8ayZVQZ5wVwW/pKH8q2xIFByG4mn+PBGSZdmm0Cc ZFlAZS0Q1vnBbaEKULkm2623V+y1sfXO+xMOqQJEegTapPwp1Xu4gwEPWE6JseN7Vw9H hT15CRePzGwUi2OsSDzeJBxZKFRGqPv0H8649im6i7INUjvChe3QsrHJct7KVktV3oqd 72g+vvHdzn2lUMYsqZOg8V8oM2gxqvA9u3qxdDX9WABaLtAuaxhy5Bl+vlvAyHY4yREl QGh2n5oOsxlddIvc3u06fyrEHLmMl7jmc4s33BTq0iApUCsJmhC26r5zQglH2DUfAsyl iecQ== MIME-Version: 1.0 X-Received: by 10.224.89.73 with SMTP id d9mr4927282qam.5.1383102311398; Tue, 29 Oct 2013 20:05:11 -0700 (PDT) Received: by 10.96.101.101 with HTTP; Tue, 29 Oct 2013 20:05:11 -0700 (PDT) In-Reply-To: References: Date: Tue, 29 Oct 2013 23:05:11 -0400 Message-ID: Subject: Re: sum of mutation.numBytes() significantly different from rfile size From: Eric Newton To: "user@accumulo.apache.org" Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org For comparison, I posted this some time ago: http://tinyurl.com/k28bkbg I was surprised that RFile was smaller than a gzip'd CSV file, too. On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner wrote: > > > > On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. > wrote: >> >> Hello, >> >> >> >> I=92m seeing about an order of magnitude difference between the number o= f >> bytes returned by mutation.numBytes() and the size of the rfiles on disk >> (Accumulo 1.4.2). Note that all of my mutations are new entries, and the= re >> are no combiners running. >> >> >> >> While I understand that there is some compression on the rfile, I would = be >> really surprised if it was 10:1. >> >> >> >> My entries are composed of a row ID (most of which is equivalent to the >> previous row ID), an empty column family, a nonempty column qualifier (w= hich >> likely shares a lot with the previous qualifier), and an empty value. An >> example of the rowID and column qualifier might be: > > > In 1.4 if a field (row, col fam, etc) in key is the same as the previous, > then its not written again. So if the row is the same in 10 consecutive > keys, its only written once. Maybe this explains the difference. Scan t= he > table to make sure all of the data you expect to be there is there. > >> >> >> >> (forward table) >> >> 0000000000000|9|fa19 IP|127.000.000.001 >> >> 0000000000000|9|fa19 PORT|00080 >> >> =85 >> >> 0000000000000|9|fa22 IP|128.032.144.139 >> >> =85 >> >> || | >> >> >> >> OR >> >> (reverse table) >> >> 0000000000000|IP|127.000.000.001 fa19 >> >> 0000000000000|IP|127.000.000.001 fd02 >> >> 0000000000000|IP|127.000.000.002 123 >> >> =85 >> >> 0000000000000|PORT|00080 fa19 >> >> >> >> The numBytes() method appears to return a number of bytes equal to the >> string length of the row ID and column qualifiers, plus 26 * # of column >> qualifiers. >> >> >> >> Is there something else that I=92m missing, or would this possibly compr= ess >> by that much? >> >> >> >> Thanks, >> >> David > >