Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 36455 invoked from network); 17 Feb 2010 21:52:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Feb 2010 21:52:04 -0000 Received: (qmail 41283 invoked by uid 500); 17 Feb 2010 21:52:03 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 41236 invoked by uid 500); 17 Feb 2010 21:52:03 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 41223 invoked by uid 99); 17 Feb 2010 21:52:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Feb 2010 21:52:03 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [212.227.126.186] (HELO moutng.kundenserver.de) (212.227.126.186) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Feb 2010 21:51:55 +0000 Received: from [192.168.0.4] (5ac95c1a.bb.sky.com [90.201.92.26]) by mrelayeu.kundenserver.de (node=mrbap0) with ESMTP (Nemesis) id 0LaJV0-1O5clP1LQM-00lKtx; Wed, 17 Feb 2010 22:51:33 +0100 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1077) Subject: Re: Machine Learning Question From: David Stuart In-Reply-To: Date: Wed, 17 Feb 2010 21:51:26 +0000 Content-Transfer-Encoding: quoted-printable Message-Id: <578A8704-181C-4C81-A2AE-D380B6371C32@progressivealliance.co.uk> References: <90DFF304-6349-45A9-A704-5268324D3DA2@progressivealliance.co.uk> To: mahout-user@lucene.apache.org X-Mailer: Apple Mail (2.1077) X-Provags-ID: V01U2FsdGVkX19gEsZTg+SF/gZUXmrxLwCmKDkieD7420Qzoo8 98Ez1aQwxwOhKfC4BD8kD6N21Owo3bXn5/bow4kUgzXrDdT36o LZABLP7BAHUOwjs3F9z1fwFygwytOAXE0LUCNuILng3CGukVRw lyg== Hi Ted, yes that helps, i think its going to take me a while to get my head = around your=20 suggestions and as a result as I start building a proof of concept ill = probably have quite a few more questions but this gives me a good starting point. I will try to give feedback about my experiences to Regards, Dave On 17 Feb 2010, at 20:22, Ted Dunning wrote: > I think I understand your question. To make sure, here it is in my = terms: >=20 > - you have documents with tag tokens in the fid field >=20 > - you have a bunch of rules for defining which documents appear where = in > your hierarchy. These rules are defined as Lucene queries. >=20 > - when you get a new document, it is slow to run every one of these = queries > against the new document. >=20 > - you would like to run these queries very quickly in order to update = your > hierarchy quickly and to provide author feedback. Using ML would be a > spiffy way to do this and might provide hints for updating your = hierarchy > rules. >=20 >=20 > My first suggestion for you would be to consider building a one = document > index for the author feedback situation. Running all of your rules = against > that index should be pretty darned fast. That doesn't help with some = of the > other issues and might be hard to do with solr, but it would be easy = with > raw Lucene. You should be able to run several thousands of rules per = second > this way. >=20 > That doesn't answer the question you asked, though. The answer there, = is > yes. Definitely. There are a number of machine learning approaches = that > could reverse engineer your rules to give you new rules that could be > evaluated very quickly. Some learning techniques and some = configurations > would likely not give you precise accuracy, but some would likely give = you > perfect replication. Random forest will probably give you accurate = results > as would logistic regression (referred to as SGD in Mahout), = especially if > you use interaction variables (that depend on the presence of tag > combinations). You will probably need to do a topological sort = because it > is common for hierarchical structures to have rules that exclude a = node from > a child if it appears in the parent (or vice versa). Thus, you would = want > to evaluate rules in dependency order and augment the document with = any > category assignments as you go down the rule list. >=20 > Operationally, you would need to do some coding and not all of the = pieces > you need are fully baked yet. The first step is vectorization of your = tag > list for many documents. Robin has recently checked in some good code = for > that and Drew has a more elaborate document model right behind that. = You > can also vectorize directly from a Lucene index which is probably very > convenient for you. That gives you training data. >=20 > Training the classifiers will take a bit since you need to train = pretty much > one classifier per category (unless you know that a document can have = only > one category). That shouldn't be hard, however, and with lots of = examples > the training should converge to perfect performance pretty quickly. = The > command line form for running training is evolving a bit right now and = your > feedback would be invaluable. >=20 > Deploying the classifiers should not be too difficult, but you would = be in > slightly new territory there since I don't think that many (any) = people have > deployed Mahout-trained classifiers in anger just yet. >=20 > Does this help? >=20 >=20 >=20 >=20 >=20 >=20 >=20 > On Wed, Feb 17, 2010 at 1:23 AM, David Stuart < > david.stuart@progressivealliance.co.uk> wrote: >=20 >> Hi All, >>=20 >> I think this question is appropriate for the Mahout mailing list but = if not >> any pointers in the right direction or advise would be welcomed. >>=20 >> We have a taxonomy based navigation system where items in the = navigation >> tree are made up of tag based queries (instead of natural language = words) >> which are matched against content items tagged in a similar way. >>=20 >> so we have a taxonomy tree with queries >> Id Label >> 001 Fruit (fid:123 or fid:675) AND -fid:(324 OR 678) ... >> 002 Round >> 003 Apple >> 004 Orange >> 006 Star >> 007 Star fruit >> .... >>=20 >> Content pool >>=20 >> "Interesting article on fruit" -> tagged with (123, 234, 675) >> "The mightly orange!" -> tagged with (123, 324, 678) >>=20 >> hopefully you get the picture.. >>=20 >> Now we bake these queries into our Solr index so instead of doing the = Fruit >> query we have pre done it and just search for items in index that = have id >> 001 the reasons for doing this are not really important but we have = written >> a indexer for the purpose. Also content items are multi-surfacing so = a item >> could appear at 001, 004 and 007 >>=20 >> Although the indexer is ok at doing this pre bake job its not very = fast and >> as the content and tree grows it gets slower. >>=20 >> NOW for the actual Question!!! >>=20 >> Is there a ML model that can quickly classify/identify where a new = (or >> retagged) piece of content fits onto the tree. Oh the queries on the = leaf >> nodes can change (less often) so a quick process to reclassify what = is in >> score for that leaf would be useful. >> The reason I want this is because it would great have realtime feed = back to >> an author applying tags to a document of where it fits in the site. >>=20 >> Once I get this working I would love to add suggested tags or = weighting >> based on content items with contextual similarity. >> I think it was Grant that was talking about a Solr external field = that >> could be used to hook this together or maybe I am mistaken >>=20 >> Hope this makes sense >>=20 >> Thanks for you help/advise in advance >>=20 >> Regards, >>=20 >> Dave >>=20 >>=20 >=20 >=20 > --=20 > Ted Dunning, CTO > DeepDyve