Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 03A5D10600 for ; Tue, 29 Oct 2013 21:50:43 +0000 (UTC) Received: (qmail 20712 invoked by uid 500); 29 Oct 2013 21:50:42 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 20681 invoked by uid 500); 29 Oct 2013 21:50:42 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 20673 invoked by uid 99); 29 Oct 2013 21:50:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Oct 2013 21:50:42 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of David.Slater@jhuapl.edu designates 128.244.251.36 as permitted sender) Received: from [128.244.251.36] (HELO pilot.jhuapl.edu) (128.244.251.36) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Oct 2013 21:50:33 +0000 Received: from aplexcas1.dom1.jhuapl.edu (unknown [128.244.198.90]) by pilot.jhuapl.edu with smtp (TLS: TLSv1/SSLv3,128bits,RC4-MD5) id 75a5_ec9b_440a3c1d_22de_4a35_94e4_2fdef56b7963; Tue, 29 Oct 2013 17:50:09 -0400 Received: from aplesstripe.dom1.jhuapl.edu ([128.244.198.211]) by aplexcas1.dom1.jhuapl.edu ([128.244.198.90]) with mapi; Tue, 29 Oct 2013 17:50:09 -0400 From: "Slater, David M." To: "user@accumulo.apache.org" Date: Tue, 29 Oct 2013 17:50:08 -0400 Subject: sum of mutation.numBytes() significantly different from rfile size Thread-Topic: sum of mutation.numBytes() significantly different from rfile size Thread-Index: Ac7U6xqtdAAjLfzwQIKYvX0izPoMBA== Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_AC78983C72177B4D9D1C14F7F4AEBA2144915A15D3aplesstripedo_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_AC78983C72177B4D9D1C14F7F4AEBA2144915A15D3aplesstripedo_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hello, I'm seeing about an order of magnitude difference between the number of byt= es returned by mutation.numBytes() and the size of the rfiles on disk (Accu= mulo 1.4.2). Note that all of my mutations are new entries, and there are n= o combiners running. While I understand that there is some compression on the rfile, I would be = really surprised if it was 10:1. My entries are composed of a row ID (most of which is equivalent to the pre= vious row ID), an empty column family, a nonempty column qualifier (which l= ikely shares a lot with the previous qualifier), and an empty value. An exa= mple of the rowID and column qualifier might be: (forward table) 0000000000000|9|fa19 IP|127.000.000.001 0000000000000|9|fa19 PORT|00080 ... 0000000000000|9|fa22 IP|128.032.144.139 ... || | OR (reverse table) 0000000000000|IP|127.000.000.001 fa19 0000000000000|IP|127.000.000.001 fd02 0000000000000|IP|127.000.000.002 123 ... 0000000000000|PORT|00080 fa19 The numBytes() method appears to return a number of bytes equal to the stri= ng length of the row ID and column qualifiers, plus 26 * # of column qualif= iers. Is there something else that I'm missing, or would this possibly compress b= y that much? Thanks, David --_000_AC78983C72177B4D9D1C14F7F4AEBA2144915A15D3aplesstripedo_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hello,

 

=

I’m seeing about an order of magnitude di= fference between the number of bytes returned by mutation.numBytes() and th= e size of the rfiles on disk (Accumulo 1.4.2). Note that all of my mutation= s are new entries, and there are no combiners running.

 

While I understand that there is some compression on the rfile, = I would be really surprised if it was 10:1.

 

My entries are composed of a row ID (most of which is equivalent to the pr= evious row ID), an empty column family, a nonempty column qualifier (which = likely shares a lot with the previous qualifier), and an empty value. An ex= ample of the rowID and column qualifier might be:

 

(forward table)

0000= 000000000|9|fa19          &nb= sp;      IP|127.000.000.001

<= p class=3DMsoNormal>0000000000000|9|fa19     &nb= sp;            PORT|= 00080

0000000000000|9|fa22    = ;            &n= bsp; IP|128.032.144.139

&#= 8230;

<timeblock>|&l= t;hash>|<uid>       <index>|<tex= tual value>

 =

OR

(reverse table)

0000000000000|IP|127.000.000.001      =    fa19

00000000= 00000|IP|127.000.000.001         fd= 02

0000000000000|IP|127.00= 0.000.002         123

0000000000000|PORT|00080     &= nbsp;           &nbs= p;    fa19

=  

The numBytes() meth= od appears to return a number of bytes equal to the string length of the ro= w ID and column qualifiers, plus 26 * # of column qualifiers.

 

Is there something else that I’m missing, or woul= d this possibly compress by that much?

 

Than= ks,

David

= --_000_AC78983C72177B4D9D1C14F7F4AEBA2144915A15D3aplesstripedo_--