hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj Das <d...@yahoo-inc.com>
Subject Re: Can mapper get access to filename being processed?
Date Mon, 08 Dec 2008 05:05:09 GMT



On 12/8/08 8:23 AM, "Andy Sautins" <andy.sautins@returnpath.net> wrote:

> 
>   Thanks.  map.input.file is exactly what I need.
> 
>   One more question.  Is there a way to ignore a file in an input path?
> So, for example, if the data in hadoop is stored in a directory
> structure /<date>/<machine>.txt.  So let's say Dec 1, 2008, I have a
> file from machine a and b, I would have the following directory
> structure:
> 
>    /20081201/a.txt
>    /20081201/b.txt
> 
>    What I'd like to do is have a job that, depending on the
> configuration, would either process all files or files for a given
> machine only ( say a, but not b ).
> 
Have a look at the APIs in FileInputFormat. There are a couple of APIs on
specifying input paths and path filters.
You might need to subclass the specific input format (like TextInputFormat)
and override the configure(JobConf) method to store the configured values of
machine names you want to process the files of (for example). That would be
a field in your derived InputFormat. Also you would need to define a
PathFilter that looks at this list of machine names and returns a true or a
false from the accept() method in PathFilter.

>    Is that possible to do or am I trying to do something that's using
> Hadoop in a way that it's not intended to be used?  I looked briefly at
> MultipleInputs which seems to be able to handle different input paths,
> but not handle a single input path in different ways depending on
> filename.
> 
>    Thanks again.
> 
>    Andy
> 
> -----Original Message-----
> From: Devaraj Das [mailto:ddas@yahoo-inc.com]
> Sent: Sunday, December 07, 2008 12:11 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Can mapper get access to filename being processed?
> 
> 
> 
> 
> On 12/7/08 11:32 PM, "Andy Sautins" <andy.sautins@returnpath.net> wrote:
> 
>>  
>> 
>>    I'm having trouble finding a way to do what I want, so I'm
> wondering
>> if I'm just not looking at the right place or if I'm thinking about
> the
>> problem in the wrong way.  Any insight would be appreciated.
>> 
>>  
>> 
>>    Let's say I have a directory of files that contains a combination
> of
>> different file types.  The MapReduce job needs to process all files in
>> the directory but generates different key/value pairs depending on the
>> file being processed.  What I'd like to do is use the filename to
>> identify the file type being processed and use that information in the
>> map job.  What it seems like what I'd want is the map job to have
> access
>> to the filename of the input file split being processed.  I haven't
> been
>> able to find out if that is available to a derived class of
>> MapReduceBase.  
>> 
>> 
> That's map.input.file available in the map via JobConf. The mapper class
> has
> to override the implementation of configure in MapReduceBase and get the
> filename via JobConf.get("map.input.file"). Store that in some field
> variable of your mapper class. You can then inspect that in your map
> method.
> 
>> 
>>    Does what I'm trying to do make sense or is there a better way of
>> processing a job like the one I'm describing?
>> 
>> 
> Look at MultipleInputs class (in the mapred.lib directory). That could
> prove
> useful.  
>> 
>>    Thank you
>> 
>>  
>> 
>>    Andy
>> 
>>    
>> 
>>  
>> 
>>     
>> 
> 
> 



Mime
View raw message