hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ranjithkumar Gampa <granji...@gmail.com>
Subject context.write() Vs FSDataOutputStream.writeBytes()
Date Fri, 28 Sep 2012 17:57:28 GMT
Hi,

we are using FSDataOutputStream.writeBytes() from map/reduce to write to
Hive table path directly instead of context.write() which is working fine
and so far no problems with this approach.
we make sure the file names are distinct by appending taskAttemptId to them
and we use speculative execution 'false' to ensure map/reducer won't work
on same data and create inconsistency in writing data to HDFS, we went for
this approach for below reasons, please let's know if any disadvantages
with it.

1) To avoid cleanup of _SUCCESS and _LOG files created by reducer/mapper
output which Hive may not like.
2) To write some records from mappers which doesn't need to participate in
Reducer logic, so can save some sort and shuffle process. We are exploring
on Multi Output format, but still above point need to be taken care I think.
3) We have some special characters in data, on which we are doing String
manipulation using 'ISO-8859-1' encoding, using Text class in
context.write() is not preserving these characters due to default utf-8
encoding used by it.

Kindly please share if my understanding is not correct and there are some
other ways of taking care above three points, I am happy to hear and learn,
our project uses mix of Hadoop MR and Hive.

Thanks in advance.

Regards,
Ranjith

Mime
View raw message