Accumulo sorts keys and then compresses them in blocks. Each block is
compressed without any information external to the block. You're going to
see different compression ratios depending on the relative entropy of keys
inside of each compressed block. For purpose of discussion, let's simplify
this to two elements in the row: AB or BA
Suppose set A is {0,1,2,3,4}, and set B is {maroon,orange,purple,yellow}.
We can make the following keys:
0maroon
0orange
0purple
0yellow
1maroon
...
and
maroon0
maroon1
maroon2
maroon3
orange0
orange1
...
Now, let's also assume that a block fits 4 keys in it. In the AB case the
first block has to represent the following:
{0,maroon,orange,purple,yellow}
In the BA case the first block has to represent the following:
{maroon,0,1,2,3}
The block in the AB case has a higher relative entropy, since the B set
contains more information and we have to represent the entire B set in the
block. You can see that visually depicted here in that the string
representation of the information in the AB block is twice as long as the
string representation of the information in the BA block. This is
admittedly a crude example, but hopefully it helps you see some of the
elements that contribute to compression ratio.
Practically speaking, the best way to get an estimate for the size of a
table is to put in some real data and take measurements. Try to add data in
such a way that your compressed blocks are going to be similar to those
that will be there in a full table. So, in the AB case sample from A and
use a complete B, and in the BA case sample from B and use a complete set
A. If you make your blocks representative of the full table then a linear
extrapolation will give you a pretty good estimate for size. Doing this
piecewise for each of the types of blocks (tables, in your case) should
also work.
Hope that helps!
Adam
On Tue, Sep 8, 2015 at 9:19 AM, z11373 <z11373@outlook.com> wrote:
> I have 3 tables, all of them have same column family name, and empty column
> qualifier.
> For row id let say it has something like below for each table ('' is a
> delimiter char in this context).
>
> Table1:
> ABC
>
> Table2:
> BCA
>
> Table3:
> CAB
>
> So as we can see above, all of them pretty much have similar content (and
> actually same row id length), and they all have same number of rows (I have
> verified it): 2,181,193 rows.
> However, when I check their table size I found different result:
> root@dev> du h t Table1
> 17.70M [Table1]
> root@dev> du h t Table2
> 27.58M [Table2]
> root@dev> du h t Table3
> 32.48M [Table3]
>
> I am a bit surprised to see the different results, but I realize that
> Accumulo applies compression to the data. Looking at those tables size
> info,
> am I right to conclude that ABC somehow seems have better compression
> rate
> than BCA, which apparently is better than CAB?
>
> With this fact, it makes my job a bit more difficult to tell management
> disk
> space estimation we need to store our data in Accumulo. Earlier I was
> thinking if we can guesstimate how many rows we may have in the future, and
> multiply it by the factor x (and perhaps also multiply by 3 for
> replication), then that's the guesstimate I can give, but now I can't even
> figure out that 'x'. Does any of you have experience on this, and perhaps
> can share?
>
> Thanks,
> Z
>
>
>
> 
> View this message in context:
> http://apacheaccumulo.1065345.n5.nabble.com/tablesizequestionstp15079.html
> Sent from the Developers mailing list archive at Nabble.com.
>
