accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1124) optimize index size in RFile
Date Mon, 16 May 2016 20:31:13 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285234#comment-15285234
] 

Keith Turner commented on ACCUMULO-1124:
----------------------------------------

I was running Fluo's Webindex example on EC2 for a long period.  After running the example
I inspected some RFiles.  Some of them had larger indexes than I expected.  I suspect making
a change mentioned in the ticket would reduce the index size.

Below is info for an rfile that uses URLs from web pages in keys.  I am going to experiment
with generating shorter keys in the index for this file.  This file was generated using 64K
data blocks and 256K index blocks.

{noformat}
[centos@leader1 ~]$ accumulo rfile-info  --histogram /accumulo/tables/7/t-0003uq7/A000rxoi.rf
2016-05-16 16:48:38,914 [rfile.PrintInfo] WARN : Attempting to find file across filesystems.
Consider providing URI instead of path
Reading file: hdfs://leader1:10000/accumulo/tables/7/t-0003uq7/A000rxoi.rf
Locality group         : notify
	Start block          : 0
	Num   blocks         : 0
	Index level 0        : 0 bytes  1 blocks
	First key            : null
	Last key             : null
	Num entries          : 0
	Column families      : [ntfy]
Locality group         : <DEFAULT>
	Start block          : 0
	Num   blocks         : 21,818
	Index level 3        : 120,581 bytes  1 blocks
	Index level 2        : 451,008 bytes  2 blocks
	Index level 1        : 714,687 bytes  3 blocks
	Index level 0        : 6,915,137 bytes  25 blocks
	First key            : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current
[] 4611686019157309597 false
	Last key             : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494...
TRUNCATED data:current [] -6917529026891043602 false
	Num entries          : 24,299,468
	Column families      : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 120,754 bytes
      Compressed size      : 21,719 bytes
      Compression type     : gz


Up to size      count      %-age
         10 :    9292962  22.56%
        100 :   14947371  74.88%
       1000 :      59017   2.45%
      10000 :        112   0.07%
     100000 :          6   0.04%
    1000000 :          0   0.00%
   10000000 :          0   0.00%
  100000000 :          0   0.00%
 1000000000 :          0   0.00%
10000000000 :          0   0.00%
{noformat}

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key to get the
reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message