mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Henß" <>
Subject Re: Automatically extracted Mahout FAQs
Date Mon, 07 Mar 2011 10:52:04 GMT

thanks for your answer and sorry that it took me so long to reply.

the quoting problem definately requires a more advanced approach. 
Currently I'm just using regular expressions which works fine for forums 
but email clients are diverse in quoting to be sure to capture them all. 
I guess I'll have a try with n-grams as soon as possible.

To generally address the question of precision and recall of my "apache 
dataset" I'll contribute some data, maybe it's interesting for you and 
others?! For each of the lists a quite even amount of threads has been 
included with a total of ~2500 threads.

### List / Precision / Recall
Spamassassin Users / 0.9838 / 0.9529
Mahout User / 1 / 0.8964
Jackrabbit Users / 0.9522 / 0.8746
Couchdb User / 0.9413 / 0.9084
Myfaces Users / 0.9102 / 0.8849
Tomcat Users / 0.9302 / 0.8116
Maven Users / 0.8742 / 0.8825
Ant User / 0.8585 / 0.8922

These are the results for the mining phase (LDA) itself. After the 
question filtering the precision will be little bit higher, just the 
recall drops (of course). I guess this selection is too restrictive (but 
other datasets benefit from it), that's why you only find few questions 
(only 100 out of approx. 400 selected for Mahout).

The selection of questions/answers is based on two assumptions, a "real" 
measure of formatting is not included yet, I only hope this is implied 
by the following to some extend:
- Too long messages are likely to be irrelevant for a FAQ for several 
reasons (but for evaluating the system I "allow" them to be up to 600 
characters long).
- An expert answer is expected to have a firm use of the domain's 
language so if the distance between the answer and the FAQ's model is 
too high the answer is dropped (and so is the question if no answer is 
- (Measuring the distance is also applied in question selection).

And finally, regarding the continuation:

This very much depends on the feedback the project receives. As it is 
still in development I haven't really made it public so far. Also a 
paper will be released soon. So if people are interested this could be a 
candidate for open source.

Best regards,


Am 23.02.2011 18:34, schrieb Ted Dunning:
> This is very nice work!
> If you have achieved this level of accuracy without direct editing, 
> then this is very impressive.  In reading through the Mahout and Math 
> questions, I noted a few issues with quoting and a few complete 
> failures, but the good answers were very good.  I think that the 
> quoting issues could be improved by looking at the degree of string 
> matching relative to the previous items in the thread.  Small n-grams 
> are very effective for this and avoid the need for full edit distance 
> calculations.  For the failed cases, even a small amount of community 
> feedback would suffice to knock out the bad answers.  I think that the 
> favorable ratio of high quality answers to low quality answers is 
> definitely high enough to make it worth looking at.  If the ratio were 
> reversed, I think users would not find it worth the time to look.
> I do note that there are a very small number of questions that have 
> been answered compared to the number that I have seen go by on the 
> mailing list.  Is that because you are being very cautious about 
> keeping precision high?
> Finally, some questions:
> a) do you use any sort of measure to determine how well written the 
> questions and answers are?
> b) is this a dead-end school project or do you plan to continue with it?
> On Tue, Feb 22, 2011 at 9:15 PM, Stefan Henß 
> < <>> wrote:
>     Hi everybody,
>     I'm currently doing research for my bachelor thesis on how to
>     automatically extract FAQs from unstructured data.
>     For this I've built a system automatically performing the following:
>     - Load thousands of conversations from forums and mailing lists
>     (don't mind the categories there).
>     - Build categorization solely based on the conversation's texts
>     (by clustering).
>     - Pick the best modelled categories as basis for one FAQ each.
>     - For each question (first entry in a conversation) find the best
>     reply from its answers.
>     - Select the most relevant and well formatted
>     question/answer-pairs for each FAQ.
>     Most of the steps almost completely rely on the data from the
>     categorization step which is obtained using the latent Dirichlet
>     allocation model.
>     For the evaluation part I'd like to ask you for having a look at
>     one or two FAQs and maybe give some comments on how far the
>     questions matched the FAQ's title, how relevant they were etc.
>     Here's the direct link to the Mahout FAQs:
>     (There are some other interesting FAQs as well at
>     Thanks for your help
>     Stefan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message