mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Stuart <>
Subject Re: Machine Learning Question
Date Wed, 17 Feb 2010 21:51:26 GMT
Hi Ted,

yes that helps, i think its going to take me a while to get my head around your 
suggestions and as a result as I start building a proof of concept ill probably have
quite a few more questions but this gives me a good starting point.

I will try to give feedback about my experiences to



On 17 Feb 2010, at 20:22, Ted Dunning wrote:

> I think I understand your question.  To make sure, here it is in my terms:
> - you have documents with tag tokens in the fid field
> - you have a bunch of rules for defining which documents appear where in
> your hierarchy.  These rules are defined as Lucene queries.
> - when you get a new document, it is slow to run every one of these queries
> against the new document.
> - you would like to run these queries very quickly in order to update your
> hierarchy quickly and to provide author feedback.  Using ML would be a
> spiffy way to do this and might provide hints for updating your hierarchy
> rules.
> My first suggestion for you would be to consider building a one document
> index for the author feedback situation.  Running all of your rules against
> that index should be pretty darned fast.  That doesn't help with some of the
> other issues and might be hard to do with solr, but it would be easy with
> raw Lucene.  You should be able to run several thousands of rules per second
> this way.
> That doesn't answer the question you asked, though.  The answer there, is
> yes.  Definitely.  There are a number of machine learning approaches that
> could reverse engineer your rules to give you new rules that could be
> evaluated very quickly.  Some learning techniques and some configurations
> would likely not give you precise accuracy, but some would likely give you
> perfect replication.  Random forest will probably give you accurate results
> as would logistic regression (referred to as SGD in Mahout), especially if
> you use interaction variables (that depend on the presence of tag
> combinations).  You will probably need to do a topological sort because it
> is common for hierarchical structures to have rules that exclude a node from
> a child if it appears in the parent (or vice versa).  Thus, you would want
> to evaluate rules in dependency order and augment the document with any
> category assignments as you go down the rule list.
> Operationally, you would need to do some coding and not all of the pieces
> you need are fully baked yet.  The first step is vectorization of your tag
> list for many documents.  Robin has recently checked in some good code for
> that and Drew has a more elaborate document model right behind that.  You
> can also vectorize directly from a Lucene index which is probably very
> convenient for you.  That gives you training data.
> Training the classifiers will take a bit since you need to train pretty much
> one classifier per category (unless you know that a document can have only
> one category).  That shouldn't be hard, however, and with lots of examples
> the training should converge to perfect performance pretty quickly.  The
> command line form for running training is evolving a bit right now and your
> feedback would be invaluable.
> Deploying the classifiers should not be too difficult, but you would be in
> slightly new territory there since I don't think that many (any) people have
> deployed Mahout-trained classifiers in anger just yet.
> Does this help?
> On Wed, Feb 17, 2010 at 1:23 AM, David Stuart <
>> wrote:
>> Hi All,
>> I think this question is appropriate for the Mahout mailing list but if not
>> any pointers in the right direction or advise would be welcomed.
>> We have a taxonomy based navigation system where items in the navigation
>> tree are made up of tag based queries (instead of natural language words)
>> which are matched against content items tagged in a similar way.
>> so we have a taxonomy tree with queries
>> Id         Label
>> 001     Fruit (fid:123 or fid:675) AND -fid:(324 OR 678) ...
>> 002     Round
>> 003               Apple
>> 004               Orange
>> 006        Star
>> 007              Star fruit
>> ....
>> Content pool
>> "Interesting article on fruit" -> tagged with (123, 234, 675)
>> "The mightly orange!" -> tagged with (123, 324, 678)
>> hopefully you get the picture..
>> Now we bake these queries into our Solr index so instead of doing the Fruit
>> query we have pre done it and just search for items in index that have id
>> 001 the reasons for doing this are not really important but we have written
>> a indexer for the purpose. Also content items are multi-surfacing so a item
>> could appear at 001, 004 and 007
>> Although the indexer is ok at doing this pre bake job its not very fast and
>> as the content and tree grows it gets slower.
>> NOW for the actual Question!!!
>> Is there a ML model that can quickly classify/identify where a new (or
>> retagged)  piece of content fits onto the tree. Oh the queries on the leaf
>> nodes can change (less often) so a quick process to reclassify what is in
>> score for that leaf would be useful.
>> The reason I want this is because it would great have realtime feed back to
>> an author applying tags to a document of where it fits in the site.
>> Once I get this working I would love to add suggested tags or weighting
>> based on content items with contextual similarity.
>> I think it was Grant that was talking about a Solr external field that
>> could be used to hook this together or maybe I am mistaken
>> Hope this makes sense
>> Thanks for you help/advise in advance
>> Regards,
>> Dave
> -- 
> Ted Dunning, CTO
> DeepDyve

View raw message