mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil" <robin.a...@gmail.com>
Subject Re: Regarding Google Summer of Code Lucene Mahout Project
Date Tue, 25 Mar 2008 13:55:21 GMT
Hi Isabel,

On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost <apache_mahout@isabel-drost.de>
wrote:
>
> On Monday 24 March 2008, Robin Anil wrote:
>
> > The Complement-Naive-Bayes-Classifier(coded up for this project) then
run on
> > the retrieved document to do post processing.
>
> The ideas presented in the slides look pretty interesting to me. Could you
> please provide some pointers to information in the Complement Naive Bayes
> Classifier? What were the reasons you chose this classifier?
>
Before going into Complement Naive Bayes there are certain things about Text
Classification. Given a good amount of data as it is in the case of textual
Data, Naive Bayes Suprisingly performs better than most of the other
supervised learners. Reason as i see it is, Naive Bayes class margins are so
bluntly defined that chances of overfitting is rare. This is also the reason
why, given the proper features Naive Bayes doesnt measure up to other
Methods. So you may say Naive Bayes in a Good Classifier for Textual Data.
Now Complement Naive Bayes does the reverse. Instead of calculating which
class fits the document best. It does, which complement class least fits the
document.  Also it removes the bias problem due to prior probability term in
NB equation. You may be interested in reading the paper which talks more
about it Here <http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>. My
BaseClassifier implementation reproduces the work there. But for different
classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits
the base classifier but the feature selection module is overloaded for each
of them.

As you can see all of them except Polarity(Classes are Pos, Neg, Neutral)
are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve
sign difference). But other things like normalization made a lot of
difference in removing the false positives and biased classes.

>
>
> > If its possible to have the classifier run along with Lucene and
> > spit out sentences and add them to a field in real-time, It would
> > essentially enable this system to be online and allow for real-time
> > queries.
>
> So what you are hoping for is a system that can crawl and answer queries
at
> the same time, integrating more and more information as it becomes
available,
> right?
>
Yes and No,
Yes because System needs to go through the index get documents and process
the Sentences and get all opinions, Not necessarity the Target.
No because the queries arent fixed. If you disregard the TREC queries, say a
person is sitting there asking for opinion about a target. He may type
"Nokia 6600" or "My left hand". Now, I would have to go though the DB and
find everything which talks about Nokia and the other and do post processing
if its not yet processed. Another reason is the ranking of the results
become a problem. How do i say which among the 1000 results gives the better
opinion. The doc that talks more about the target or the one which has more
opinions about the target. Neither, we need to rank them based on the output
of Classification Algorithms.

This is where i see the use of Mahout. Say we have the core Lucene
Architecture modded with Mahout. If i can give the results of Mahout
Classifier to lucene for Ranking function. Based on Subjectivity, Polarity
etc. Not only will it become easy to Implement Good IR Systems for Research.
It can give rise to Some real funky use cases for Complex Production IR
Systems.
>
> > I would gladly answer any queries except results
>
> Hmm, so for this competition there is no sample dataset available to test
the
> performance of the algorithms against? Sounds like there is no way to
> determine which of two competing solutions is better except making two
> submissions...
>
Well throughout the year, Competing researchers give One or two Queries and
Hand Made results. Which is compiled and tested against each other.

> Isabel
>
>
> --
> The ideal voice for radio may be defined as showing no substance, no
sex,no
> owner, and a message of importance for every housewife.         -- Harry
V. Wade
>
>
>
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xmpp://MaineC.@spaceboyz.net>
>



-- 
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message