incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajay Singh <singh.ajay1...@gmail.com>
Subject Hadoop streaming with HCatalog
Date Wed, 24 Jul 2013 12:34:38 GMT
Hi all,

I have been playing around with HCatalog for last few days. I love it.
Great work!!!

Is the HCatalog team planning to enhance hadoop-streaming to use HCatalog
in near future? Can we expect it by this year?

I looked at the hadoop-streaming code and was wondering how one would
enhance hadoop-streaming to consume from / write to  HCatalog managed
tables.

Hadoop-streaming launches a map-reduce job. The MapTask of this job manages
the communication with the external non-java mapper (through stdin and
stdout). The ReduceTask does the same with the non-java reducer. I doubt
anything needs to be changed here. What needs changed is the input to
MapTask and output from ReduceTask.
So if we modify the MapTask/ReduceTask to read from / write to a HCatalog
table, that should do it right? Since HCatalog already supports M/R, this
should be just about re-writing the streaming job using HCatInputFormat,
HCatOutputFormat, HCatRecord etc. I noticed that hadoop-streaming uses old
map-red API (org.apache.mapred). Do we need to move to new map-reduce API
(org.apache.mapreduce) to use HCatalog?

Thanks
Ajay

Mime
View raw message