hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Enhancement to TextInputFormat?
Date Thu, 06 Jul 2006 09:18:58 GMT
Hi,

   Here's a scenario I have faced a couple of times recently:

   <scenario>

     I have a list of URIs (either http:// or just dfs file-list) which 
represent input to a Map-Reduce task where each map gets 1 URI, gets 
data from the URI (read either through dfs apis or over http as the case 
maybe) and then manipulates that data.

   </scenario>


   In-essence it's a simple TextInputFormat with each 'line' 
representing not the actual 'data' to manipulate in the map, but an 
'indirection' to the data.

   Do you guys think it makes sense to provide this as a part of the MR 
framework itself? i.e. extend TextInputFormat into (say) URIInputFormat 
and the MR framework then 'fetches' the data (the 'fetcher'/'reader' is 
configurable with reasonable defaults provided in the framework e.g. for 
dfs://, http:// etc.)  pointed to by the URI and then provides a 
'stream' (as 'key') to the map function?

   Admittedly it isn't very hard to do as-is today, however it would 
definitely ease the user's job. All he needs is to provide a simple text 
file with a list of URIs and then gets a readable stream in his map. 
Thus reducing the amount of 'code' he has to write and enhancing his 
experience.

   Thoughts?

   If there is sufficient interest/utility I will go ahead and spec this 
in more detail and create a jira issue.

thanks,
Arun






Mime
View raw message