mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soren Flexner <sflex...@gmail.com>
Subject Re: Fuzzy matching
Date Mon, 02 May 2011 15:59:14 GMT
  James,

  If you have names along with those addresses, the top level (blocking)
approach I would take would be to run through the whole set and create a
group of tags using, say, last name + house number + zip.  Then you could
write a MR job where you make those tags the key, and your 'people' will
automatically be sorted together prior to being sent to the reducers.
Attempting de-duplication beyond blocking invariably leads to fairly
sophisticated feature vector creation and machine learning algorithm
application to get much more mileage.

  The problem with using any other part of the address (or for that matter
the first name) is normalization.

  For example, you'll often see incorrect street direction characters at the
beginning (or end) of streets ( N. Franklin, when it's really Franklin N.
etc).

  Sometimes streets have multiple names.

  A lot of times, apt building letters can get mixed up into the beginning
or ending of street names (e.g. 1100 A Franklin St N)

  In the city field, you'll see Santa Monica, when it's really LA, or the
other way around.

  For names, you'll see Bill instead of William -etc.

  These problems are quite significant in user-generated data.

  Make the decision early-on as to whether you want to get 80-90% (just use
blocking) or if you need better precision/recall than that.  If you decide
to go whole hog, that last 10-20% is a monster, so just be ready for that.

  -s

On Fri, Apr 29, 2011 at 3:30 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> If these are street addresses, brute force will probably get you 99% of the
> way.
>
> The traditional dedupe for addresses is to take the number plus the first
> letter of the street name within a city (for smaller cities) or zip code
> region (possibly zip-4, possibly zip-5).  Sort by this extracted key and do
> detailed matching on the matches.
>
> From there, standard edit distance with an ever growing pile of heuristics
> works pretty well.  I have at times built fancy models to mark duplicates
> but the supply of truly weird special cases that defy modeling seems
> infinite so that is a pretty unrewarding activity.  You can get the error
> rate down to a few percent or less pretty quickly using a cycle of sort,
> apply heuristics, inspect manually, build new heuristics, lather, rinse and
> repeat.
>
> This process is made to order for map-reduce, but your data is actually not
> that large for the sorting part even in a conventional database.  Your data
> is large enough and the map-reduce will be fast enough that using Hadoop
> may
> work out well.  Using Hive or Pig would make that much easier especially
> when you start injecting your infinite supply of heuristic adjustments.
>
> If you really want to do a full everything to everything fuzzy compare, try
> extracting short sub-strings of 2-4 characters and compute a fuzzy match
> score based on sub-string counts. Since the address to substring relation
> is
> a sparse one, you can do this computation using map-reduce without the full
> n^2 cost.  You will need to throw away the most common substrings to avoid
> quadratic explosion, but that doesn't hurt accuracy.
>
> If you meant something else, please holler!
>
> On Fri, Apr 29, 2011 at 12:50 PM, James Pettyjohn <jamesp@scientology.net
> >wrote:
>
> >
> >
> > Hey,
> >
> > First time writing in.
> >
> > I have around 6 million active records
> > in a contacts database. Additional millions of history address records
> for
> > these records. I got a new 60+ thousand records which are not correlated
> to
> > these that I need to fuzzy match against both active and historical
> > records.
> >
> > I will need to do the same thing with the database against
> > itself for de-duplication later. The data is primarily in Oracle (with
> the
> > supplement in csv's).
> >
> > I saw the Booz/Allen/Hamilton presentation on fuzzy
> > matching - but I don't see any distributions for that implementation. At
> > the same time I don't need real time query - just batch processing at the
> > moment.
> >
> > I thought Mahout might fit the bill. Any comments on approach
> > would be appreciated.
> >
> > Best, James
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message