mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SAMIK CHAKRABORTY <sam...@gmail.com>
Subject Re: Tags generation?
Date Tue, 07 Aug 2012 12:37:05 GMT
Hi All,

We have developed an auto tagging system for our micro-blogging platform.
Here is what we have done:

The purpose of the system was to look for tags in an articles automatically
when someone posts a link in our micro-blogging site. The goal was to allow
us to follow a tag instead (in addition) of (to) a person. So we used some
custom code on top of Mahout, UIMA, Open-NLP etc.

If you are interested to see how it works take a look at:
http://www.scoopspot.com/

One more thing, we also created a robot which goes to some of the well
known web sites like: Read Write Web, Hackers News, Tech Crunch etc which
gets the article from the web and publishes that to our micro-blog. As we
already have the tag following, we get the information without any problem.
That's very cool (to us at least). You can see the output of the robot at
this location:

http://news.scoopspot.com/

I thought, this might be an example of what Mahout can do and related to
this thread, so felt like sharing with you guys.

Sorry if it looks like off-topic.

Regards,
Samik

On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog <goksron@gmail.com> wrote:

> I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
> 'verb', etc. I removed all words that were not nouns or verbs. In my
> use case, this is a total win. In other cases, maybe not: Twitter has
> a quite varied non-grammer.
>
> On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <pat@farfetchers.com> wrote:
> > The way back from stem to tag is interesting from the standpoint of
> making tags human readable. I had assumed a lookup but this seems much more
> satisfying and flexible. In order to keep frequencies it will take
> something like a dictionary creation step in the analyzer. This in turn
> seems to imply a join so a whole new map reduce job--maybe not completely
> trivial?
> >
> > It seems that NLP can be used in two very different ways here. First as
> a filter (keep only nouns and verbs?) second to differentiate semantics
> (can:verb, can:noun). One method is a dimensional reduction technique the
> other increases dimensions but can lead to orthogonal dimensions from the
> same term. I suppose both could be used together as the above example
> indicates.
> >
> > It sounds like you are using it to filter (only?) Can you explain what
> you mean by:
> > "One thing came through- parts-of-speech selection for nouns and verbs
> > helped 5-10% in every combination of regularizers.'
> >
> >
> > On Aug 3, 2012, at 6:31 PM, Lance Norskog <goksron@gmail.com> wrote:
> >
> > Thanks everyone- I hadn't considered the stem/synonym problem. I have
> > code for regularizing a doc/term matrix with tf, binary, log and
> > augmented norm for the cells and idf, gfidf, entropy, normal (term
> > vector) and probabilistic inverse. Running any of these, and then SVD,
> > on a Reuters article may take 10-20 ms. This uses a sentence/term
> > matrix for document summarization. After doing all of this, I realized
> > that maybe just the regularized matrix was good enough.
> >
> > One thing came through- parts-of-speech selection for nouns and verbs
> > helped 5-10% in every combination of regularizers. All across the
> > board. If you want good tags, select your parts of speech!
> >
> > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
> > <dawid.weiss@cs.put.poznan.pl> wrote:
> >> I know, I know. :) Just wanted to mention that it could lead to funny
> >> results, that's all. There are lots of way of doing proper form
> >> disambiguation, including shallow tagging which then allows to
> >> retrieve correct base forms for lemmas, not stems. Stemming is
> >> typically good enough (and fast) so your advise was 100% fine.
> >>
> >> Dawid
> >>
> >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >>> This is definitely just the first step.  Similar goofs happen with
> >>> inappropriate stemming.  For instance, AIDS should not stem to aid.
> >>>
> >>> A reasonable way to find and classify exceptional cases is to look at
> >>> cooccurrence statistics.  The contexts of original forms can be
> examined to
> >>> find cases where there is a clear semantic mismatch between the
> original
> >>> and the set of all forms that stem to the same form.
> >>>
> >>> But just picking the most common that is present in the document is a
> >>> pretty good step for all that it produces some oddities.  The results
> are
> >>> much better than showing a user the stemmed forms.
> >>>
> >>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <
> dawid.weiss@cs.put.poznan.pl>wrote:
> >>>
> >>>>> Unstemming is pretty simple.  Just build an unstemming dictionary
> based
> >>>> on
> >>>>> seeing what word forms have lead to a stemmed form.  Include
> frequencies.
> >>>>
> >>>> This can lead to very funny (or not, depends how you look at it)
> >>>> mistakes when different lemmas stem to the same token. How frequent
> >>>> and important this phenomenon is varies from language to language (and
> >>>> can be calculated apriori).
> >>>>
> >>>> Dawid
> >>>>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message