hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject Re: TableInputFormat vs. a map of table regions (data locality)
Date Thu, 18 Nov 2010 07:02:09 GMT
What are the benefits you are looking for with the first option?
With TableInputFormat it'll start as many map tasks as you have regions and
data processing will benefit from data locality. From javadoc (
http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html
):

"Reading from hbase, the TableInputFormat asks hbase for the list of regions
and makes a map-per-region or mapred.map.tasks maps, whichever is
smaller[...]. Maps will run on the adjacent TaskTracker if you are running a
TaskTracer and RegionServer per node."

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Thu, Nov 18, 2010 at 6:30 AM, Saptarshi Guha <saptarshi.guha@gmail.com>wrote:

> Hello,
>
> I'm fairly new to HBase and would appreciate your comments.
>
> [1] One way compute across an HBase dataset would be to run as many
> maps as regions,
> for each map, run a scan across the region row limits (within the map
> method). This approach does not use TableInputFormat.In the reduce (if
> needed),
> directly write (using put) to the table.
>
>
> [2] In the *second* approach I could use the TableInputFormat and
> TableOutputFormat.
>
> My hypotheses:
>
> H1: As for TableOutputFormat, I think both approaches, performance-wise are
> equivalent. Correct me if I'm wrong.
>
> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
> TableSplit source reveals location information. At first blush I can
> imagine in
> approach [1] I scan from row_start to row_end all the data of which
> resides on a computer different from the compute node on which the split is
> being run. Since TableInputFormat (approach [2]) uses region information,
> my
> guess (not sure at all) is that Hadoop Mapreduce will assign the
> computation to
> the node where the region lies and so when the scan is issued the queries
> will
> be issued against local data - achieving data locality. So it makes sense
> to
> take advantage of (at the least) the TableSplit information.
>
> Are my hypotheses correct?
>
> Thanks
> Joy
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message