accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1124) optimize index size in RFile
Date Mon, 23 May 2016 23:10:13 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297313#comment-15297313
] 

Keith Turner commented on ACCUMULO-1124:
----------------------------------------

I experimented with shortening keys in the index and that gave some nice improvements, but
not as much as I expected.  I realized that even with those changes, bad keys were still being
placed in the index.  I added code to keep statistics on key sizes and used those statistics
to try to select keys that were <=AVG(keySize).  I also excluded keys that were too big
(greater than 3 std dev from the mean).  With the key shortening and statistics changes I
was able to reduce the index size for the file in my previous comment to that below.

{noformat}
RFile Version            : 8

Locality group           : <DEFAULT>
	Num   blocks           : 21,758
	Index level 1          : 3,048 bytes  1 blocks
	Index level 0          : 1,873,885 bytes  8 blocks
	First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current
[] 4611686019157309597 false
	Last key               : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494...
TRUNCATED data:current [] -6917529026891043602 false
	Num entries            : 24,299,468
	Column families        : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 3,163 bytes
      Compressed size      : 1,515 bytes
      Compression type     : gz
{noformat}

At first I thought I could make these changes in 1.6 and 1.7.  However while working on this
I realized the key shortening change is breaking change, in that older RFile code would not
be able to handle keys in the index that do not exist in the data.   The changes to uses statistics
to choose better keys could be made in 1.6 and 1.7.

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.8.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key to get the
reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message