hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leon Mergen" <l...@solatis.com>
Subject Re: Hadoop and retrieving data from HDFS
Date Thu, 24 Apr 2008 18:41:15 GMT
Hello Peeyush,

On Thu, Apr 24, 2008 at 8:12 PM, Peeyush Bishnoi <peeyushb@yahoo-inc.com>

> Yes you can very well store your data in Tabular Format into Hbase by
> applying the Map-Reduce job on Access logs which has been stored on HDFS .
> So while you initially copy the data in HDFS , your data blocks will be
> created which will be stored on Datanode . After processing of data , it
> will be stored in Hbase HRegion. So your unprocessed data on HDFS and
> processed data in Hbase will be distributed across machines.

Ah yes, I also understood this from reading the BigTable paper and the HBase
architecture docs; HBase uses regions of about 256MB in size, which are
stored on top of HDFS.

But now I am wondering: after that data has been stored inside HBase, is it
possible to process this data without moving it to a different machine ? Say
that I want to data mine on around 100TB of data; if all that data had to be
moved around the cluster before it could be processed, it would be a bit
inefficient. Isn't it a good idea to just process those log files on the
servers they are physically stored on, and, perhaps, allow striping of
multiple MapReduce jobs on the same data by making use of the replication ?

Or is this a bad idea ? Since I've always understood that moving processing
to servers that the data is stored on is cheaper than moving the data to the
servers they can be processed on.


Leon Mergen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message