hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristofer Weber <cristofer.we...@neogrid.com>
Subject RES: Bulk Import & Data Locality
Date Wed, 18 Jul 2012 21:39:32 GMT
Hi Alex,

I ran one of our bulk import jobs with partial payload, without proceeding with major compaction,
and you are right: Some hdfs blocks are in a different datanode.

-----Mensagem original-----
De: Alex Baranau [mailto:alex.baranov.v@gmail.com] 
Enviada em: quarta-feira, 18 de julho de 2012 12:46
Para: hbase-user@hadoop.apache.org; mapreduce-user@hadoop.apache.org; hdfs-user@hadoop.apache.org
Assunto: Bulk Import & Data Locality


As far as I understand Bulk Import functionality will not take into account the Data Locality
question. MR job will create number of reducer tasks same as regions to write into, but it
will not "advice" on which nodes to run these tasks. In that case Reducer task which writes
HFiles of some region may not be physically located at the same node as RS that serves that
region. The way HDFS writes data, there will be (likely) one full replica of bolcks of HFiles
of this Region written on the node where Reducer task was run and other replicas (if replication
>1) will be distributed randomly over the cluster. Thus, RS while serving data of that
region will (most
likely) not look at local data (data will be transferred from other datanodes). I.e. data
locality will be broken.

Is this correct?

If yes, I guess, if we could tell MR framework where (which nodes) to launch certain Reducer
tasks, this would help us. I believe this is not possible with MR1, please correct me if I'm
wrong. Perhaps, this is this possible with MR2?

I assume there's no way to provide a "hint" to a NameNode where to place blocks of a new File
too, right?

Thank you,
Alex Baranau
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

View raw message