accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <David.Sla...@jhuapl.edu>
Subject RE: sum of mutation.numBytes() significantly different from rfile size
Date Wed, 30 Oct 2013 15:47:25 GMT
Comparing the rfiles with compressed CSV files, the results do make sense now.

Thanks,
David

-----Original Message-----
From: Eric Newton [mailto:eric.newton@gmail.com] 
Sent: Tuesday, October 29, 2013 11:05 PM
To: user@accumulo.apache.org
Subject: Re: sum of mutation.numBytes() significantly different from rfile size

For comparison, I posted this some time ago:

http://tinyurl.com/k28bkbg

I was surprised that RFile was smaller than a gzip'd CSV file, too.

On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <keith@deenlo.com> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. 
> <David.Slater@jhuapl.edu>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I'm seeing about an order of magnitude difference between the number 
>> of bytes returned by mutation.numBytes() and the size of the rfiles 
>> on disk (Accumulo 1.4.2). Note that all of my mutations are new 
>> entries, and there are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I 
>> would be really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to 
>> the previous row ID), an empty column family, a nonempty column 
>> qualifier (which likely shares a lot with the previous qualifier), 
>> and an empty value. An example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the 
> previous, then its not written again.  So if the row is the same in 10 consecutive
> keys, its only written once.   Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19                 IP|127.000.000.001
>>
>> 0000000000000|9|fa19                  PORT|00080
>>
>> ...
>>
>> 0000000000000|9|fa22                  IP|128.032.144.139
>>
>> ...
>>
>> <timeblock>|<hash>|<uid>       <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001         fa19
>>
>> 0000000000000|IP|127.000.000.001         fd02
>>
>> 0000000000000|IP|127.000.000.002         123
>>
>> ...
>>
>> 0000000000000|PORT|00080                      fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to 
>> the string length of the row ID and column qualifiers, plus 26 * # of 
>> column qualifiers.
>>
>>
>>
>> Is there something else that I'm missing, or would this possibly 
>> compress by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>

Mime
View raw message