Thanks much Daniel for the response. The solution looks good though I wonder
it might miss terms in A which contain spaces because STRSPLIT would break
each term in B before doing the equi join.
Regards
Arun A K
Graduate Student
Department of Computer Science
Indiana University, Bloomington
On Thu, Dec 2, 2010 at 11:17 AM, Daniel Dai <jianyong@yahooinc.com> wrote:
> Can you convert it into a equal join problem? That's the case mapreduce can
> handle efficiently. Not sure if it address your problem but provide a sample
> script.
>
> a = load 'A' as (a0:chararray);
> b = foreach a generate LOWER(a0) as b0;
>
> c = load 'B' as (c0:chararray);
> d = foreach c generate LOWER(c0) as d0;
> e = foreach d generate d0, flatten(STRSPLIT(d0)) as e0;
>
> f = join b by b0, e by e0;
> g = foreach f generate d0, b0;
> dump g;
>
> Daniel
>
>
> Arun A K wrote:
>
>> Hello
>>
>> I have this problem to solve using Pig.
>>
>> *Input*
>> 1. Relation A which has only one field of type chararray. Sample of A
>> follows:
>> *abc*
>> *xyz gh*
>> *zzz yy*
>> *red*
>>
>> Approximate numbers of rows in A = 10,000
>>
>> 2. Relation B which has only one field of type chararray. Sample of B
>> follows:
>> *red car*
>> *red ferrari*
>> *abc*
>> *abcd*
>> *xyz ghis*
>>
>> Approximate numbers of rows in B = 1 billion
>>
>> *Problem to be solved* I need to find all caseinsensitive variants of
>> each
>> term in relation A existing in relation B. For example: Term 'red' from A
>> would have variants 'red car' and 'red ferrari' in B.
>>
>> I was able to get variants of one term in A from B using matches operator
>> i.e. matches '.*red.*' How to go about creating a complete solution for
>> this
>> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
>> on how to proceed on this. I would really appreciate any help on this.
>>
>> Thanks much.
>>
>> Regards
>> Arun A K
>>
>>
>
>
