mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mBria <>
Subject Some guidance for this noob - "Metadata Matching Engine"
Date Thu, 10 May 2012 21:57:13 GMT
Hi everyone,

This may be a bit long, and I apologize up front.  I'm new to Mahout (And
Machine Learning in general), and haven't actually built anything beyond the
MiA book's examples with it.

I'm looking for a little nudge/guidance on where to direct my next level of
research/experimentation for a real-world problem.

Basically, I need "document matching" support.  Context laundry-list:
- "doc" is a somewhat sparse document with a set of 10-15 fields of varying
length text (usually phrases) & numerical fields.
- it's sparse in that not all fields will be valued for all docs
- docs are almost always "logical duplicates" of a few other docs (say, 2-5
on average);  we'll call a set of "dup docs" a "cluster"
- there are millions of docs (and thus many thousands of "clusters")
- although they are logical duplicates, the field values may be similar, but
are often not identical (degree of "similarity" will vary non-trivially)
- I've got an "example" document set (millions) already clustered (manually) 
in production

So, what I want to build is a system that can take NEW documents, and give
automated insight into which of the existing cluster this document belongs,
or an indication that it belongs to none.

Initially, I saw this most as a "*CLASSIFICATION *problem":
- I've got a immense /training set/ already
- I want to "classify" new stuff based on smart /field-level similarity/
- I want to pick one "class" (ie, cluster) the doc belongs to

The problem with this (maybe?) is that I'm gathering that classification
really works best for BINARY classes ("you go here, or you go there").  My
case is that there are thousands of classes (clusters), and it may even be
that the given doc doesn't really fit any of them well (in which it should
become a new cluster of one).  To a lesser degree, I'd like to know I could
if I wanted get the system to tell a a small set of clusters the new doc may
fit well with with a "score".

Looking at this then from a "*CLUSTERING *problem" angle:
- yes, I want docs "clustered" based on similarity of its field values
- but, I've already got the existing millions of docs already clustered, and
I just want to funnel news docs into the clusters

So, while "clustering docs" is definitely the end result of the system, I
don't really think this is an obvious "clustering problem" from the
ML/Mahout POV.  Least not a standard one.

Looking at this from a "*RECOMMENDATION* problem" angle:
- I can kinda think of the existing clusters as being clusters as containing
docs "related" to the other docs in the cluster
- Then I could say this new doc is like another existing doc, which
"associates" to these other docs (in the cluster) therefore this new one
associates to those other ones (and belongs in the cluster)

But, beyond this being a real stretch and probably silly (useless), the big
missing aspect is the ability to leverage doc field similarity.  It's
advanced field value similarity which really drives the "match".  So, I
don't think Recommenders help much here.

My gut is telling me I want some hybrid of clustering and classification,
but I'm not sure.

So, my head is still running full-speed trying see this in various ways to
see what I can use from Mahout to contribute to my system, but before I got
too far down my own rabbit holes I wanted to Ask The Expert.

Again, sorry for the novel!

Any ideas, references to things to look at, anything at all that you think
might be helpful would be great.  Not looking for anyone to "hand me the
solution", but polling for guidance.

Thanks much!

View this message in context:
Sent from the Mahout User List mailing list archive at

View raw message