hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piyush Kansal <piyush.kan...@gmail.com>
Subject Query regarding Hadoop version 0.20.203
Date Wed, 14 Mar 2012 21:44:21 GMT
Hi,

Since MultipleOutputs is not supported in version 0.20.203, so while using
Partitioner class, key-value pairs belonging to partition 1 may end up in
file part-r-00000 or part-r-00002. So, to handle this, I am currently
*prefixing
all the records* in a file with a "*partition number*". So, lets say 4
files gets created on HDFS as follows:

part-r-00000: lets say it contains all records for partition 2
part-r-00001: lets say it contains all records for partition 1
part-r-00002: lets say it contains all records for partition 3
part-r-00003: lets say it contains all records for partition 0

Now, I am creating a new command to append all these files into a single
file on the local file system based on "*increasing order of partition
number*". While doing this, I have to remove the partition number from all
the records. I can do it by reading all the files line by line and then
using substring, can extract the required data and put it in the o/p file.
But, this approach will take too much time as this functionality is
intended to be run on very huge files (GBs in size).

So, can you please suggest if there can be an alternative way to implement
this functionality so as to get it done in minimum time.

-- 
Regards,
Piyush Kansal

Mime
View raw message