hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2161) getRow() is orders of magnitudes slower than get(), even on rows with one column
Date Wed, 07 Nov 2007 03:45:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540661
] 

stack commented on HADOOP-2161:
-------------------------------

Actually, I misspoke.  getFull scanning memory and all on-disk files is not 'wrong' -- though
it is slow.  Here's why. 

Columns can be added willy-nilly.  There is no need of an ALTER TABLE-like statement adding
a column as there is in a traditional RDBMS -- as long as the column belongs to an existing
column family (has an extant column family for a prefix). 

And there is no accounting anywhere in hbase of all the columns made in any particular family.
  Since there is no list of all-columns to consult, the only way hbase can be sure its found
all column mentions is if it scans all data.  This is main difference between get and getFull.
 Because you provide a list of columns to fetch to get, it can know when its done.  Not so
with getFull.

Is it important to you that this run faster Clint?  If so, there may be some things we can
do like keep an integer of counts of unique column names.  getFull would know that when it
had hit the count of all column names, it could return (Keeping a list of all column names
would probably not be viable since in some schemas it might grow without bound).

> getRow() is orders of magnitudes slower than get(), even on rows with one column
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-2161
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2161
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: latest from trunk
>            Reporter: Clint Morgan
>         Attachments: PerformanceEvaluation-patch.txt
>
>
> HTable.getRow(Text) is several orders of magnitude slower than
> HTable.get(Text, Text), even on rows with a single column.
> This problem can be observed by the attached patch of
> PerformanceEvaluation.java which changes SequentialRead to use getRow,
> and prints out the time for each read. 
> The test can the be run with:
> bin/hbase org.apache.hadoop.hbase.PerformaeEvaluation sequentialRead 1
> On my laptop, the original test (using get()) produces reads on the order of 5-20
> milliseconds. Using getRow(), the reads take 50-2000 ms. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message