couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Henß" <>
Subject Re: Automatically extracted CouchDB FAQs
Date Mon, 07 Mar 2011 11:45:45 GMT
Hi Eli,

the subtitle is definately missleading. It should only give an idea of 
the topics contained in the FAQ, not what it is limited to :-)

I do remove generic english terms before the clustering but not mailing 
list-specific terms. In fact those are the ones I'm trying to find :-) 
In order to validate that the clustering is working properly I consider 
threads from a bunch of different mailing lists (currently 8) as data 
basis and assign no label to them. So those common words are my best 
hint in "rebuilding" the original mailing lists.

But I still agree with your point. After the first clusters are found 
(hopefully including a 100% precise couchdb FAQ) I again run the mining 
algorithm on the set of threads for each cluster to generate the 
second-level categorization. At this point I should definately remove 
too generic words for this cluster as they can only distort the further 
analysis. Thanks for pointing this out.

Best regards,

Am 23.02.2011 21:10, schrieb Eli Stevens (Gmail):
> Interesting project.  :)
> I didn't get a very strong sense of correlation between the topic
> categories and the questions in them.  For example,
> "Questions&  Answers about Couchdb, Couch, Replication, Databases and Database."
> Had the following question:
> "I'm looking for a recommendation for ruby gem that will enable me to
> use couchdb from rails. I'd like to have couch documents be modeled by
> ActiveRecord."
> This didn't have any mention of replication (or databases), so I can
> only guess that it was clustering on "couch" or "couchdb".
> Do you do any screening of common terms from the clustering?  I'd
> imagine that if you looked at the user@couchdb mailing list, you could
> find a list of very common terms (like couch, couchdb, database, etc.)
> and discard or ignore those when trying to cluster the messages (in
> the same way that words like "the" and "and" shouldn't be used).
> Basically, a per-mailing-list set of generic terms.
> The questions and answers themselves seemed to be a nice, readable "I
> have X problem" "here is an answer" pair, so that was cool.  :)
> HTH,
> Eli
> On Tue, Feb 22, 2011 at 8:24 PM, Stefan Henß
> <>  wrote:
>> Hi everybody,
>> I'm currently doing research for my bachelor thesis on how to automatically
>> extract FAQs from unstructured data.
>> For this I've built a system automatically performing the following:
>> - Load thousands of conversations from forums and mailing lists (don't mind
>> the categories there).
>> - Build categorization solely based on the conversation's texts (by
>> clustering).
>> - Pick the best modelled categories as basis for one FAQ each.
>> - For each question (first entry in a conversation) find the best reply from
>> its answers.
>> - Select the most relevant and well formatted question/answer-pairs for each
>> FAQ.
>> For the evaluation part I'd like to ask you for having a look at one or two
>> FAQs and maybe give some comments on how far the questions matched the FAQ's
>> title, how relevant they were etc.
>> Here's the direct link to the CouchDB FAQs:
>> And here a quite good example in my opinion:
>> (There are some other interesting FAQs as well at
>> Thanks for your help
>> Stefan

View raw message