hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Look <al...@shopzilla.com>
Subject Re: Generating FAQ's from Stack Overflow?
Date Tue, 15 Mar 2011 03:05:44 GMT
Interesting, faqcluster.com does appear to be a successful application of Mahout.. Categorizing
by question+all replies seems like a smart approach.

Do you think that choosing the best answer based on the cosine similarity of inlcuded terms
is the based way to go?  Is choosing a single answer even the best approach? It seems that
in many cases, a coherent answer to a question emerges after a number of people have replied
to the question at hand.

For instance, at http://faqcluster.com/question-521113443 :
    * Q: Is there a SVM classifier implemented in Mahout?
    * A: See also o.a.m.classifier.sgd.TrainNewsGroups

While in the source conversation, a number of useful pieces of information (even additional
questions) are divulged in between the question and faqcluster’s chosen answer: http://lucene.472066.n3.nabble.com/SVM-classifier-td2028948.html
    * Q1: Is there a SVM classifier implemented in Mahout?
    * A1: No. But the SGD classifier should have similar characteristics. There is also a
rough draft of an SVM implementation available as a patch.
   * Q2: where can I know more about SGD classifier? mahout wiki did not help :(
   * A2i: https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression <https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression>
 Sorry about that.  What queries did you use?
   * A2ii: See also o.a.m.classifier.sgd.TrainNewsGroups

Looking at this, I think that an interesting approach (to extracting the most useful information
from a thread) would be to take the original question and all replies, and form an adjacency

Match(Q1, A1) = (SVM, classifier)
Match(Q1, Q2) = (classifier, mahout)
Match(A1, Q2) = (SGD, classifier)

This way coherent responses could be chained together in order to aggregate more useful information,
while people replying on tangents or spamming would tend to get left out.

Thoughts on how mahout might help create such an adjacency matrix? Obviously cosine similarity
would still form the distances between each reply in a given thread, but it seems like having
some way of weighting each term’s specificity would help too – i.e. SGD or SVM are more
specific than classifier, and classifier is more specific than Mahout since we’re looking
at the mahout mailing list...

- Andrew

On 3/14/11 6:39 PM, "Ted Dunning" <tdunning@maprtech.com> wrote:

I found it.  The student in question was named Stefan Henß.  See here for details: http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3C4D660038.2000807@gmail.com%3E

The results were quite surprisingly good for how simple the techniques used are.

On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <tdunning@maprtech.com> wrote:
I have looked but can't find the postings by a student who recently posted about their FAQ
extraction program.  The results were pretty good in terms of precision and the extracted
answers were very nice.  The methods used were quite simple.

Does anybody else remember this interchange?  Did it not occur here?  Did I imagine it?

On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <alook@shopzilla.com> wrote:
Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)

On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fvanvollenhoven@xebia.com>

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
> I think automatic question extraction is a quite ambitious goal.
> Friso
> On 1 mrt 2011, at 19:12, Stack wrote:
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <otis_gospodnetic@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>> Hm... we already index HBase and other Digests on search-hadoop.com <http://search-hadoop.com>
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message