hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5987) HFileBlockIndex improvement
Date Wed, 16 May 2012 18:01:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276941#comment-13276941
] 

Phabricator commented on HBASE-5987:
------------------------------------

mbautin has commented on the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex improvement".

  Looks good! A few minor comments inline. Also please submit the diff with lint (using "arc
diff --preview" instead of "arc diff --only")/

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/HConstants.java:545 Please add a comment that the
actual value is irrelevant because this is always compared by reference.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:437-440 This documentation
is still confusing. Is i "the ith position", or is the actual key "the ith position"? I would
say i is the "position" and the returned key is the "key at the ith position".
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:413 Clarify the meaning
of "is equal", i.e. that it must be exactly the same object, not just an equal byte array.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:63 This is unnecessary
(we don't use compression by default).
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:77 It is not "schemMetricSnapshot",
it is "schemaMetricSnapshot" ("schem" is not a word).

REVISION DETAIL
  https://reviews.facebook.net/D3237

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu

                
> HFileBlockIndex improvement
> ---------------------------
>
>                 Key: HBASE-5987
>                 URL: https://issues.apache.org/jira/browse/HBASE-5987
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>         Attachments: D3237.1.patch, D3237.2.patch, screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when multiple requests
are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been addressed
in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex about the
data block location for each target key value during the scan process(reSeekTo), even though
the target key value has already been in the current data block. This issue will cause certain
index block very HOT, especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the HFileScanner would
know the start key value of next data block. So if the target key value for the scan(reSeekTo)
is "smaller" than that start kv of next data block, it means the target key value has a very
high possibility in the current data block (if not in current data block, then the start kv
of next data block should be returned. +Indexing on the start key has some defects here+)
and it shall NOT query the HFileBlockIndex in this case. On the contrary, if the target key
value is "bigger", then it shall query the HFileBlockIndex. This improvement shall help to
reduce the hotness of HFileBlockIndex and avoid some unnecessary IdLock Contention or Index
Block Cache lookup.
> Secondary, we propose to push this idea a little further that the HFileBlockIndex shall
index on the last key value of each data block instead of indexing on the start key value.
The motivation is to solve the HBASE-4443 issue (avoid seeking to "previous" block when key
you are interested in is the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of the data
block N. There is no way for sure the target key value is in the data block N or N-1. So it
has to seek from data block N-1. However, if the block index is based on the last key value
for each data block and the target key value is beween the last key value of data block N-1
and data block N, then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes more sense
to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message