accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: sum of mutation.numBytes() significantly different from rfile size
Date Wed, 30 Oct 2013 03:05:11 GMT
For comparison, I posted this some time ago:

http://tinyurl.com/k28bkbg

I was surprised that RFile was smaller than a gzip'd CSV file, too.

On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <keith@deenlo.com> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. <David.Slater@jhuapl.edu>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I’m seeing about an order of magnitude difference between the number of
>> bytes returned by mutation.numBytes() and the size of the rfiles on disk
>> (Accumulo 1.4.2). Note that all of my mutations are new entries, and there
>> are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I would be
>> really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to the
>> previous row ID), an empty column family, a nonempty column qualifier (which
>> likely shares a lot with the previous qualifier), and an empty value. An
>> example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the previous,
> then its not written again.  So if the row is the same in 10 consecutive
> keys, its only written once.   Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19                 IP|127.000.000.001
>>
>> 0000000000000|9|fa19                  PORT|00080
>>
>> …
>>
>> 0000000000000|9|fa22                  IP|128.032.144.139
>>
>> …
>>
>> <timeblock>|<hash>|<uid>       <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001         fa19
>>
>> 0000000000000|IP|127.000.000.001         fd02
>>
>> 0000000000000|IP|127.000.000.002         123
>>
>> …
>>
>> 0000000000000|PORT|00080                      fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to the
>> string length of the row ID and column qualifiers, plus 26 * # of column
>> qualifiers.
>>
>>
>>
>> Is there something else that I’m missing, or would this possibly compress
>> by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>

Mime
View raw message