hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Pettyjohn <jam...@scientology.net>
Subject Fuzzy matching
Date Fri, 29 Apr 2011 06:50:11 GMT


First time writing in. 

I am in the midst the setup of a small
cluster for testing (using Cloudera's CDH 3 on Fedora 14) and I trying to
work out a good model for a use case which just came up. 

I have around 6
million active records in a contacts database. Additional millions of
history address records for these records. I got a new 60+ thousand records
which are not correlated to these that I need to fuzzy match against both
active and historical records. 

It starts there, but I will need to do the
same thing with the database against itself for de-duplication. The data is
primarily in Oracle (with the supplement in csv's). 

I saw the
Booz/Allen/Hamilton presentation on fuzzy matching - but I don't see any
distributions for that implementation. At the same time I don't need real
time now, I need batch. 

Mahout might be the way to go, but I think I'm
re-inventing at least a wheel or two. 

Any comments appreciated. 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message