accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1124) optimize index size in RFile
Date Tue, 24 May 2016 17:29:13 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298561#comment-15298561
] 

Keith Turner commented on ACCUMULO-1124:
----------------------------------------

I pushed a commit to the PR that adds a {{--keyStats}} option to rfile-info.  Below is the
output of running this command on the original file.  Can see that the 6 largest keys all
ended up in the index.  Also the average key size in the index is over twice that of the data.
 

{noformat}
$ accumulo rfile-info --keyStats ~/A000rxoi.rf 
Reading file: file:/home/fluo/A000rxoi.rf
RFile Version            : 7

Locality group           : notify
	Start block            : 0
	Num   blocks           : 0
	Index level 0          : 0 bytes  1 blocks
	First key              : null
	Last key               : null
	Num entries            : 0
	Column families        : [ntfy]
Locality group           : <DEFAULT>
	Start block            : 0
	Num   blocks           : 21,818
	Index level 3          : 120,581 bytes  1 blocks
	Index level 2          : 451,008 bytes  2 blocks
	Index level 1          : 714,687 bytes  3 blocks
	Index level 0          : 6,915,137 bytes  25 blocks
	First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current
[] 4611686019157309597 false
	Last key               : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494...
TRUNCATED data:current [] -6917529026891043602 false
	Num entries            : 24,299,468
	Column families        : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 120,754 bytes
      Compressed size      : 21,719 bytes
      Compression type     : gz


Statistics for keys in data :
	Up to size      count      %-age
	         10 :   10768926  26.51%
	        100 :   13471699  70.82%
	       1000 :      58725   2.56%
	      10000 :        112   0.07%
	     100000 :          6   0.04%
	    1000000 :          0   0.00%
	   10000000 :          0   0.00%
	  100000000 :          0   0.00%
	 1000000000 :          0   0.00%
	10000000000 :          0   0.00%

	min:      31.00 max: 330,380.00 avg:     122.99 stddev:     157.51

Statistics for keys in index :
	Up to size      count      %-age
	         10 :       6192   7.67%
	        100 :      15024  49.96%
	       1000 :        578  13.21%
	      10000 :         18   8.83%
	     100000 :          6  20.33%
	    1000000 :          0   0.00%
	   10000000 :          0   0.00%
	  100000000 :          0   0.00%
	 1000000000 :          0   0.00%
	10000000000 :          0   0.00%

	min:      36.00 max: 330,380.00 avg:     281.73 stddev:   3,901.56
$
{noformat}

Below is the output of running this command on a file compacted using the code in the PR.
 None of the largest keys are in the index and the average key size in the index is less than
half of whats in the data.

{noformat}
$ accumulo rfile-info --keyStats /accumulo/tables/2/default_tablet/A0000005.rf
Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
RFile Version            : 8

Locality group           : <DEFAULT>
	Num   blocks           : 21,758
	Index level 1          : 3,048 bytes  1 blocks
	Index level 0          : 1,873,885 bytes  8 blocks
	First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current
[] 4611686019157309597 false
	Last key               : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494...
TRUNCATED data:current [] -6917529026891043602 false
	Num entries            : 24,299,468
	Column families        : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 3,163 bytes
      Compressed size      : 1,515 bytes
      Compression type     : gz


Statistics for keys in data :
	Up to size      count      %-age
	         10 :   10768926  26.51%
	        100 :   13471699  70.82%
	       1000 :      58725   2.56%
	      10000 :        112   0.07%
	     100000 :          6   0.04%
	    1000000 :          0   0.00%
	   10000000 :          0   0.00%
	  100000000 :          0   0.00%
	 1000000000 :          0   0.00%
	10000000000 :          0   0.00%

	min:      31.00 max: 330,380.00 avg:     122.99 stddev:     157.51

Statistics for keys in index :
	Up to size      count      %-age
	         10 :      18153  68.40%
	        100 :       3602  31.43%
	       1000 :          1   0.17%
	      10000 :          0   0.00%
	     100000 :          0   0.00%
	    1000000 :          0   0.00%
	   10000000 :          0   0.00%
	  100000000 :          0   0.00%
	 1000000000 :          0   0.00%
	10000000000 :          0   0.00%

	min:       9.00 max:   2,134.00 avg:      58.49 stddev:      36.23
$
{noformat}

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.8.0
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key to get the
reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message