pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Dai <jiany...@yahoo-inc.com>
Subject Re: Find variants of a term in relation A from a field in relation B
Date Thu, 02 Dec 2010 19:17:51 GMT
Can you convert it into a equal join problem? That's the case mapreduce 
can handle efficiently. Not sure if it address your problem but provide 
a sample script.

a = load 'A' as (a0:chararray);
b = foreach a generate LOWER(a0) as b0;

c = load 'B' as (c0:chararray);
d = foreach c generate LOWER(c0) as d0;
e = foreach d generate d0, flatten(STRSPLIT(d0)) as e0;

f = join b by b0, e by e0;
g = foreach f generate d0, b0;
dump g;


Arun A K wrote:
> Hello
> I have this problem to solve using Pig.
> *Input*
> 1. Relation A which has only one field of type chararray. Sample of A
> follows:
> *abc*
> *xyz gh*
> *zzz yy*
> *red*
> Approximate numbers of rows in A = 10,000
> 2. Relation B which has only one field of type chararray. Sample of B
> follows:
> *red car*
> *red ferrari*
> *abc*
> *abcd*
> *xyz ghis*
> Approximate numbers of rows in B = 1 billion
> *Problem to be solved* I need to find all case-insensitive variants of each
> term in relation A existing in relation B. For example: Term 'red' from A
> would have variants 'red car' and 'red ferrari' in B.
> I was able to get variants of one term in A from B using matches operator
> i.e. matches '.*red.*' How to go about creating a complete solution for this
> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
> on how to proceed on this. I would really appreciate any help on this.
> Thanks much.
> Regards
> Arun A K

View raw message