hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ari Cooperman <aricooper...@gmail.com>
Subject Multiple file join in map/reduce
Date Tue, 22 Sep 2009 01:12:34 GMT
Sorry if this question is common, I looked through docs, code and mail  
archives and did not find everything that answered these questions......

Say I have 3 files A, B & C, each file has a set of records I want to  
parse through, and the record location is already indexed the same  
across files, i.e. the second record in A maps to the second record in  
B, which maps to the second record in C. However, the record lengths  
in each file are different and thus the file size and block counts are  
different. I want to be able to sometimes read one, two, or all of the  
files depending on my needs for the job run. What I would like to  
happen is that all the records for each file end up on the same host  
so that it is always local access. So ideally the block sizes would be  
different for each file so that the first block for A has the same  
record count as the first block for B, etc. So my questions are:

1) I notice that on creating a file I can give a block size to the  
file, which would, if the records are fixed size, allow me to manually  
create equal record counts, but is this just a hint to the system?  
Will it be honored or could it use a different block size under  
certain conditions?

2) Even if I can get the proper record counts split across the files,  
is there a way to make sure that the corresponding blocks across files  
are located on the same node? If so, is there a way to prevent them  
from being split up if the system rebalances data blocks?

Thanks for any help....

View raw message