crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Retrieving Input File Name with MRPipeline
Date Mon, 22 Jun 2015 19:59:33 GMT
The InputSplit on the MapContext implements the InputSupplier interface,
which allows you to get the underlying FileSplit that the map task is
processing. So you have to do a bunch of casting, but you can get at it.

On Monday, June 22, 2015, David Ortiz <dpo5003@gmail.com> wrote:

> Gave it a shot in the following MapFn, but it seems to always return null.
>
> new MapFn<String, Pair<String, String>>() {
>
>    private static final long serialVersionUID = 1L;
>    int min = minColumns;
>    int max = maxColumns;
>
>    @Override
>    public Pair<String, String> map(String input) {
>       //int columns = StringUtils.countMatches(input, "\t") + 1;
>       int columns = input.split("\t").length;
>       if (columns >= min && columns <= max) {
>          StringBuilder output = new StringBuilder(input);
>          output.append('\t');
>          String loc = this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
>          output.append(loc);
>          return new Pair<>(output.toString(), null);
>       } else {
>          return new Pair<>(null, input);
>       }
>    }
>
> }
>
>
> Also tried setting crunch.disable.combine.file to true figuring that combine files might
mess with it.  No dice.  Does anything look suspect in that snippet?
>
>
> Thanks,
>
>     Dave
>
>
> On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <mkwhitacre@gmail.com
> <javascript:_e(%7B%7D,'cvml','mkwhitacre@gmail.com');>> wrote:
>
>> The DoFn should give you access to the TaskInputOutputContext[1] which
>> should contain that information.  I believe the context then should hold
>> the file as a config like "MAP_INPUT_FILE".  I haven't really tested
>> this out so definitely verify.
>>
>>
>> [1] -
>> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>>
>> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <dpo5003@gmail.com
>> <javascript:_e(%7B%7D,'cvml','dpo5003@gmail.com');>> wrote:
>>
>>> Hello,
>>>
>>>       Is there a way in my crunch pipeline that I can retrieve the file
>>> name of the input file for my MapFn?  This function is definitely applied
>>> as a Mapper, so I think it should be possible, just having some difficulty
>>> working through the exact method of doing so.
>>>
>>> Thanks,
>>>       Dave
>>>
>>
>>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message