Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1077)
Subject: Re: Machine Learning Question
From: David Stuart <david.stuart@progressivealliance.co.uk>
In-Reply-To: <c7d45fc71002171222u18eb6441i1957b4f315306abb@mail.gmail.com>
Date: Wed, 17 Feb 2010 21:51:26 +0000
Content-Transfer-Encoding: quoted-printable
Message-Id: <578A8704-181C-4C81-A2AE-D380B6371C32@progressivealliance.co.uk>
References: <90DFF304-6349-45A9-A704-5268324D3DA2@progressivealliance.co.uk>
 <c7d45fc71002171222u18eb6441i1957b4f315306abb@mail.gmail.com>
To: mahout-user@lucene.apache.org

Hi Ted,

yes that helps, i think its going to take me a while to get my head =
around your=20
suggestions and as a result as I start building a proof of concept ill =
probably have
quite a few more questions but this gives me a good starting point.

I will try to give feedback about my experiences to

Regards,

Dave

On 17 Feb 2010, at 20:22, Ted Dunning wrote:

> I think I understand your question.  To make sure, here it is in my =
terms:
>=20
> - you have documents with tag tokens in the fid field
>=20
> - you have a bunch of rules for defining which documents appear where =
in
> your hierarchy.  These rules are defined as Lucene queries.
>=20
> - when you get a new document, it is slow to run every one of these =
queries
> against the new document.
>=20
> - you would like to run these queries very quickly in order to update =
your
> hierarchy quickly and to provide author feedback.  Using ML would be a
> spiffy way to do this and might provide hints for updating your =
hierarchy
> rules.
>=20
>=20
> My first suggestion for you would be to consider building a one =
document
> index for the author feedback situation.  Running all of your rules =
against
> that index should be pretty darned fast.  That doesn't help with some =
of the
> other issues and might be hard to do with solr, but it would be easy =
with
> raw Lucene.  You should be able to run several thousands of rules per =
second
> this way.
>=20
> That doesn't answer the question you asked, though.  The answer there, =
is
> yes.  Definitely.  There are a number of machine learning approaches =
that
> could reverse engineer your rules to give you new rules that could be
> evaluated very quickly.  Some learning techniques and some =
configurations
> would likely not give you precise accuracy, but some would likely give =
you
> perfect replication.  Random forest will probably give you accurate =
results
> as would logistic regression (referred to as SGD in Mahout), =
especially if
> you use interaction variables (that depend on the presence of tag
> combinations).  You will probably need to do a topological sort =
because it
> is common for hierarchical structures to have rules that exclude a =
node from
> a child if it appears in the parent (or vice versa).  Thus, you would =
want
> to evaluate rules in dependency order and augment the document with =
any
> category assignments as you go down the rule list.
>=20
> Operationally, you would need to do some coding and not all of the =
pieces
> you need are fully baked yet.  The first step is vectorization of your =
tag
> list for many documents.  Robin has recently checked in some good code =
for
> that and Drew has a more elaborate document model right behind that.  =
You
> can also vectorize directly from a Lucene index which is probably very
> convenient for you.  That gives you training data.
>=20
> Training the classifiers will take a bit since you need to train =
pretty much
> one classifier per category (unless you know that a document can have =
only
> one category).  That shouldn't be hard, however, and with lots of =
examples
> the training should converge to perfect performance pretty quickly.  =
The
> command line form for running training is evolving a bit right now and =
your
> feedback would be invaluable.
>=20
> Deploying the classifiers should not be too difficult, but you would =
be in
> slightly new territory there since I don't think that many (any) =
people have
> deployed Mahout-trained classifiers in anger just yet.
>=20
> Does this help?
>=20
>=20
>=20
>=20
>=20
>=20
>=20
> On Wed, Feb 17, 2010 at 1:23 AM, David Stuart <
> david.stuart@progressivealliance.co.uk> wrote:
>=20
>> Hi All,
>>=20
>> I think this question is appropriate for the Mahout mailing list but =
if not
>> any pointers in the right direction or advise would be welcomed.
>>=20
>> We have a taxonomy based navigation system where items in the =
navigation
>> tree are made up of tag based queries (instead of natural language =
words)
>> which are matched against content items tagged in a similar way.
>>=20
>> so we have a taxonomy tree with queries
>> Id         Label
>> 001     Fruit (fid:123 or fid:675) AND -fid:(324 OR 678) ...
>> 002     Round
>> 003               Apple
>> 004               Orange
>> 006        Star
>> 007              Star fruit
>> ....
>>=20
>> Content pool
>>=20
>> "Interesting article on fruit" -> tagged with (123, 234, 675)
>> "The mightly orange!" -> tagged with (123, 324, 678)
>>=20
>> hopefully you get the picture..
>>=20
>> Now we bake these queries into our Solr index so instead of doing the =
Fruit
>> query we have pre done it and just search for items in index that =
have id
>> 001 the reasons for doing this are not really important but we have =
written
>> a indexer for the purpose. Also content items are multi-surfacing so =
a item
>> could appear at 001, 004 and 007
>>=20
>> Although the indexer is ok at doing this pre bake job its not very =
fast and
>> as the content and tree grows it gets slower.
>>=20
>> NOW for the actual Question!!!
>>=20
>> Is there a ML model that can quickly classify/identify where a new =
(or
>> retagged)  piece of content fits onto the tree. Oh the queries on the =
leaf
>> nodes can change (less often) so a quick process to reclassify what =
is in
>> score for that leaf would be useful.
>> The reason I want this is because it would great have realtime feed =
back to
>> an author applying tags to a document of where it fits in the site.
>>=20
>> Once I get this working I would love to add suggested tags or =
weighting
>> based on content items with contextual similarity.
>> I think it was Grant that was talking about a Solr external field =
that
>> could be used to hook this together or maybe I am mistaken
>>=20
>> Hope this makes sense
>>=20
>> Thanks for you help/advise in advance
>>=20
>> Regards,
>>=20
>> Dave
>>=20
>>=20
>=20
>=20
> --=20
> Ted Dunning, CTO
> DeepDyve