Thanks Zach for introducing me to Aho Corasick algorithm. Let me see if a
UDF can help here.
Regards
Arun A K
Graduate Student
Department of Computer Science
Indiana University, Bloomington
On Thu, Dec 2, 2010 at 11:01 AM, Zach Bailey <zach.bailey@dataclip.com>wrote:
>
> I would recommend writing a UDF based on the Aho Corasick algorithm [1]
> which is seeded with relation a and then applied to relation b. This should
> let you do a singlepass search on each "row" in B, reducing your
> algorithmic complexity from O(n^2) to O(n).
>
>
> I'd love to know if anyone else has a better way of doing this, because
> it's a problem I'm also working on :)
>
> Zach
>
>
>
> [1]
> http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
>
> On Thursday, December 2, 2010 at 1:53 PM, Arun A K wrote:
>
> > Hello
> >
> > I have this problem to solve using Pig.
> >
> > *Input*
> > 1. Relation A which has only one field of type chararray. Sample of A
> > follows:
> > *abc*
> > *xyz gh*
> > *zzz yy*
> > *red*
> >
> > Approximate numbers of rows in A = 10,000
> >
> > 2. Relation B which has only one field of type chararray. Sample of B
> > follows:
> > *red car*
> > *red ferrari*
> > *abc*
> > *abcd*
> > *xyz ghis*
> >
> > Approximate numbers of rows in B = 1 billion
> >
> > *Problem to be solved* I need to find all caseinsensitive variants of
> each
> > term in relation A existing in relation B. For example: Term 'red' from A
> > would have variants 'red car' and 'red ferrari' in B.
> >
> > I was able to get variants of one term in A from B using matches operator
> > i.e. matches '.*red.*' How to go about creating a complete solution for
> this
> > problem? Should I use a UDF or go for native Map Reduce? Am a bit
> confused
> > on how to proceed on this. I would really appreciate any help on this.
> >
> > Thanks much.
> >
> > Regards
> > Arun A K
> >
> >
> >
> >
>
>
>
>
