accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4314) Use statistics to choose better keys for RFile index
Date Tue, 31 May 2016 23:24:12 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308840#comment-15308840
] 

Keith Turner commented on ACCUMULO-4314:
----------------------------------------

I ran test with the changes in 1.7 for this issue using the same file I was testing the changes
for ACCUMULO-1124 with.  The total index size went from 6.9M to 3.6M.  

{noformat}
$ accumulo rfile-info /accumulo/tables/2/default_tablet/A0000005.rf
Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
Locality group         : <DEFAULT>
    Start block          : 0
    Num   blocks         : 20,041
    Index level 1        : 4,140 bytes  1 blocks
    Index level 0        : 3,620,079 bytes  14 blocks
    First key            : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current
[] 4611686019157309597 false
    Last key             : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494...
TRUNCATED data:current [] -6917529026891043602 false
    Num entries          : 24,299,468
    Column families      : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 4,258 bytes
      Compressed size      : 2,154 bytes
      Compression type     : gz
{noformat}

> Use statistics to choose better keys for RFile index
> ----------------------------------------------------
>
>                 Key: ACCUMULO-4314
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4314
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Blocker
>             Fix For: 1.6.6, 1.7.2
>
>
> The commit for ACCUMULO-1124 makes two changes :
>   * Generates shorter keys that may not exist in data to place in RFile index
>   * Use statistics to make better choices about what keys to place in index.  These changes
look for keys that are average or below and excludes large keys (keys that are > 3 std
dev).
> The change to generate shorter keys can not be made in 1.7.X and 1.6.X because it would
generate RFiles that may not work properly with older 1.6 and 1.7 versions.   However the
changes to use statistics to pick better keys could be made in 1.6 and 1.7. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message