accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (ACCUMULO-348) Adding splits to table via the shell with addsplits is very slow when adding a lot of split points
Date Fri, 30 Mar 2012 20:27:29 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242631#comment-13242631
] 

Keith Turner edited comment on ACCUMULO-348 at 3/30/12 8:27 PM:
----------------------------------------------------------------

I put together a workaround for 1.3.5 and 1.4.0 and posted it on github.  This adds lots of
splits to a table much faster.

  https://github.com/keith-turner/Accumulo-Parallel-Splitter

While testing this I discovered more about why adding lots of splits is slow and another workaround.
 While trying to add 99,999 splits to a table using the addsplits command in the shell, I
noticed on the monitor page that the rate seemed to be slowing down.  I used jstack to look
at the process adding split points and noticed the stack traces were always doing metadata
lookups.  After a split the client has to refresh its tablet location cache by looking in
the metadata table.  I went to the tablet server and saw that metadata lookups were taking
more than a quater second.

{noformat}
30 17:36:09,458 [tabletserver.TabletServer] DEBUG: MultiScanSess xxx.xxx.xxx.3:42412 4 entries
in 0.29 secs (lookup_time:0.29 secs tablets:1 ranges:1)
{noformat}

I thought about why this was going on and it occurred to me that the code was always splitting
the last tablet.  This meant that columns in the metadata table were always getting updated
and therefore had lots of versions.  These versions were all kept in memory and suppressed
by the versioning iterator.  About 60k tablets had been added.  I knew if I flushed the metadata
table, it would get rid of all of these version.  Below is the minor compaction caused by
flushing the metadata table.   It read 1.4M and wrote 724K, so it dropped almost 700K key/values.
 Some of the dropped data may have been deleted tables from previous experiments, some of
it was old versions of key/values for the last tablet.  

{noformat}
30 17:36:09,698 [tabletserver.Compactor] DEBUG: Compaction !0;~;p\\;3c7 1,394,754 read | 724,252
written | 581,874 entries/sec |  2.397 secs
{noformat}

After the flush metadata lookups by the client doing the split were much faster and the rate
of adding splits shot up.

{noformat}
30 17:36:09,773 [tabletserver.TabletServer] DEBUG: MultiScanSess xxx.xxx.xxx.3:42412 4 entries
in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
{noformat}

So another work around is to periodically flush the metadata table when adding lots of splits.


                
      was (Author: kturner):
    I put together a workaround for 1.3.5 and 1.4.0 and posted it on github.  This adds lots
of splits to a table much faster.

  https://github.com/keith-turner/Accumulo-Parallel-Splitter

While testing this I discovered more about why adding lots of splits is slow and another workaround.
 While trying to add 99,999 splits to a table using the addsplits command in the shell, I
noticed on the monitor page that the rate seemed to be slowing down.  I used jstack to look
at the process adding split points and noticed the stack traces were always doing metadata
lookups.  After a split the client has to refresh its tablet location cache by looking in
the metadata table.  I went to the tablet server and saw that metadata lookups were taking
more than a quater second.

{noformat}
30 17:36:09,458 [tabletserver.TabletServer] DEBUG: MultiScanSess xxx.xxx.xxx.3:42412 4 entries
in 0.29 secs (lookup_time:0.29 secs tablets:1 ranges:1)
{noformat}

I thought about why this was going on and it occurred to me that the code was always splitting
the last tablet.  This meant that columns in the metadata table were always getting updated
and therefore had lots of versions.  These versions were all kept in memory and suppressed
by the versioning iterator.  About 60k tablets had been added.  I knew if I flushed the metadata
table, it would get rid of all of these version.  Below is the minor compaction caused by
flushing the metadata table.   It read 1.4M and wrote 724K, so it dropped almost 700K old
versions.  

{noformat}
30 17:36:09,698 [tabletserver.Compactor] DEBUG: Compaction !0;~;p\\;3c7 1,394,754 read | 724,252
written | 581,874 entries/sec |  2.397 secs
{noformat}

After the flush metadata lookups by the client doing the split were much faster and the rate
of adding splits shot up.

{noformat}
30 17:36:09,773 [tabletserver.TabletServer] DEBUG: MultiScanSess xxx.xxx.xxx.3:42412 4 entries
in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
{noformat}

So another work around is to periodically flush the metadata table when adding lots of splits.


                  
> Adding splits to table via the shell with addsplits is very slow when adding a lot of
split points
> --------------------------------------------------------------------------------------------------
>
>                 Key: ACCUMULO-348
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-348
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.3.5
>            Reporter: Dave Marion
>            Priority: Minor
>             Fix For: 1.5.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message