pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun A K <arnkri...@gmail.com>
Subject Find variants of a term in relation A from a field in relation B
Date Thu, 02 Dec 2010 18:53:23 GMT
Hello

I have this problem to solve using Pig.

*Input*
1. Relation A which has only one field of type chararray. Sample of A
follows:
*abc*
*xyz gh*
*zzz yy*
*red*

Approximate numbers of rows in A = 10,000

2. Relation B which has only one field of type chararray. Sample of B
follows:
*red car*
*red ferrari*
*abc*
*abcd*
*xyz ghis*

Approximate numbers of rows in B = 1 billion

*Problem to be solved* I need to find all case-insensitive variants of each
term in relation A existing in relation B. For example: Term 'red' from A
would have variants 'red car' and 'red ferrari' in B.

I was able to get variants of one term in A from B using matches operator
i.e. matches '.*red.*' How to go about creating a complete solution for this
problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
on how to proceed on this. I would really appreciate any help on this.

Thanks much.

Regards
Arun A K

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message