hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1938) Make in-memory table scanning faster
Date Mon, 16 Nov 2009 21:00:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778529#action_12778529
] 

stack commented on HBASE-1938:
------------------------------

Profiling in-memory scanning of MemStore:

+ All time is in cacheNextRow as you'd expect.  41% of CPU is doing SortedSet#first, 24% in
making an Iterator, and 22% doing the isEmpty test (which calls the #first method).   Each
of these methods end up in KVComparator.compare.  All our scanning time is doing compares.
 Alot of time is spent making up ints and longs out of bytes; e.g. getKeyLength and getRowLength.
 I can see that some of these constructions -- e.g. getKeyLength -- happen multiple times
in a single scan for a single KV (imagine if multiple concurrent scans).  This would seem
to argue that we cache the construction of lengths but there'd be an associated memory cost....
maybe do it for just a few of these lengths?  Key length?  For example, calculating keylength
once on construction would seem to make scanning near 30% faster in simple test.

Without caching of KeyLength:

Loaded
Scan: 2406
Scan: 1685
Scan: 1656
Scan: 1655
Scan: 1646
Scan: 1647
Scan: 1646


With caching of KeyLength:
Loaded
Scan: 1970
Scan: 1282
Scan: 1292
Scan: 1252
Scan: 1273
Scan: 1272
Scan: 1284
Scan: 1220
..

Let me attach patches that have amended test and the change I made to KV.

+ The "reputable lads" mentioned above think our getting tailSet for each cache of row content
is wasteful, that we should be able do to better -- say take out iterator once and keep it
for life of the Scanner.  On snapshot, we'd have to poke all outstanding Scanners to readjust
themselves.  Looking at the numbers, though actually taking a tailset is surprisingly inexpensive,
the tests for isEmpty and creation of Iterator each time are bulk of CPU.  Let me play with
changing the MemStoreScanner implementation to be just a set.

> Make in-memory table scanning faster
> ------------------------------------
>
>                 Key: HBASE-1938
>                 URL: https://issues.apache.org/jira/browse/HBASE-1938
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: test.patch
>
>
> This issue is about profiling hbase to see if I can make hbase scans run faster when
all is up in memory.  Talking to some users, they are seeing about 1/4 million rows a second.
 It should be able to go faster than this (Scanning an array of objects, they can do about
4-5x this).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message