crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dpo5...@gmail.com>
Subject Re: Retrieving Input File Name with MRPipeline
Date Tue, 23 Jun 2015 00:44:56 GMT
That did it.  Thanks Josh

On Mon, Jun 22, 2015 at 3:59 PM Josh Wills <jwills@cloudera.com> wrote:

> The InputSplit on the MapContext implements the InputSupplier interface,
> which allows you to get the underlying FileSplit that the map task is
> processing. So you have to do a bunch of casting, but you can get at it.
>
> On Monday, June 22, 2015, David Ortiz <dpo5003@gmail.com> wrote:
>
>> Gave it a shot in the following MapFn, but it seems to always return null.
>>
>> new MapFn<String, Pair<String, String>>() {
>>
>>    private static final long serialVersionUID = 1L;
>>    int min = minColumns;
>>    int max = maxColumns;
>>
>>    @Override
>>    public Pair<String, String> map(String input) {
>>       //int columns = StringUtils.countMatches(input, "\t") + 1;
>>       int columns = input.split("\t").length;
>>       if (columns >= min && columns <= max) {
>>          StringBuilder output = new StringBuilder(input);
>>          output.append('\t');
>>          String loc = this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
>>          output.append(loc);
>>          return new Pair<>(output.toString(), null);
>>       } else {
>>          return new Pair<>(null, input);
>>       }
>>    }
>>
>> }
>>
>>
>> Also tried setting crunch.disable.combine.file to true figuring that combine files
might mess with it.  No dice.  Does anything look suspect in that snippet?
>>
>>
>> Thanks,
>>
>>     Dave
>>
>>
>> On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <mkwhitacre@gmail.com>
>> wrote:
>>
>>> The DoFn should give you access to the TaskInputOutputContext[1] which
>>> should contain that information.  I believe the context then should hold
>>> the file as a config like "MAP_INPUT_FILE".  I haven't really tested
>>> this out so definitely verify.
>>>
>>>
>>> [1] -
>>> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>>>
>>> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <dpo5003@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>>       Is there a way in my crunch pipeline that I can retrieve the file
>>>> name of the input file for my MapFn?  This function is definitely applied
>>>> as a Mapper, so I think it should be possible, just having some difficulty
>>>> working through the exact method of doing so.
>>>>
>>>> Thanks,
>>>>       Dave
>>>>
>>>
>>>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Mime
View raw message