Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 20307 invoked from network); 23 Feb 2011 20:10:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Feb 2011 20:10:49 -0000 Received: (qmail 83296 invoked by uid 500); 23 Feb 2011 20:10:47 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 83254 invoked by uid 500); 23 Feb 2011 20:10:46 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 83246 invoked by uid 99); 23 Feb 2011 20:10:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Feb 2011 20:10:46 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of wickedgrey@gmail.com designates 209.85.220.180 as permitted sender) Received: from [209.85.220.180] (HELO mail-vx0-f180.google.com) (209.85.220.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Feb 2011 20:10:39 +0000 Received: by vxc38 with SMTP id 38so3236329vxc.11 for ; Wed, 23 Feb 2011 12:10:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=syfamQCg5ZSsMiGnDUno1OZRS3vl+7YGTSUKAeD3DFk=; b=ekhzJbM8Dg+QilbiVVP9/CkEBMETc+Ly914kzDL/edkVbQB7IOedtZIW1yhxljEBVZ bSxhA5Wj7wEAaLLc31rJYOmBjZTODt6PIwBNLEEEtN1XVVksWglWI/+ivPfpnY+zJOJe RCPMZgKSjBWxYpSCpyr05dAl6umj5cnJXKXWA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=e/tJt8XLOrMTcqSek/9unpgd28woolBWZcTgOoVMZo7OPzVdStBnyETU9O2gUtd/FV qklLc2c3WytsaxZIOhVwrCblhsjk8CHaoT21VfVNpAjcOwTFRpaD7ImU6OTA/WrF35Gf WrOVFkOBKhdSiUULWW5aQsvfAdLygLR9V6/Ug= MIME-Version: 1.0 Received: by 10.52.165.202 with SMTP id za10mr6940506vdb.164.1298491818414; Wed, 23 Feb 2011 12:10:18 -0800 (PST) Received: by 10.52.156.200 with HTTP; Wed, 23 Feb 2011 12:10:18 -0800 (PST) In-Reply-To: <4D648C10.7020901@gmail.com> References: <4D648C10.7020901@gmail.com> Date: Wed, 23 Feb 2011 12:10:18 -0800 Message-ID: Subject: Re: Automatically extracted CouchDB FAQs From: "Eli Stevens (Gmail)" To: user@couchdb.apache.org Cc: =?ISO-8859-1?Q?Stefan_Hen=DF?= Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Interesting project. :) I didn't get a very strong sense of correlation between the topic categories and the questions in them. For example, http://faqcluster.com/couchdb-replication-couch-databases-database "Questions & Answers about Couchdb, Couch, Replication, Databases and Datab= ase." Had the following question: http://faqcluster.com/question1996757514 "I'm looking for a recommendation for ruby gem that will enable me to use couchdb from rails. I'd like to have couch documents be modeled by ActiveRecord." This didn't have any mention of replication (or databases), so I can only guess that it was clustering on "couch" or "couchdb". Do you do any screening of common terms from the clustering? I'd imagine that if you looked at the user@couchdb mailing list, you could find a list of very common terms (like couch, couchdb, database, etc.) and discard or ignore those when trying to cluster the messages (in the same way that words like "the" and "and" shouldn't be used). Basically, a per-mailing-list set of generic terms. The questions and answers themselves seemed to be a nice, readable "I have X problem" "here is an answer" pair, so that was cool. :) HTH, Eli On Tue, Feb 22, 2011 at 8:24 PM, Stefan Hen=DF wrote: > Hi everybody, > > I'm currently doing research for my bachelor thesis on how to automatical= ly > extract FAQs from unstructured data. > > For this I've built a system automatically performing the following: > - Load thousands of conversations from forums and mailing lists (don't mi= nd > the categories there). > - Build categorization solely based on the conversation's texts (by > clustering). > - Pick the best modelled categories as basis for one FAQ each. > - For each question (first entry in a conversation) find the best reply f= rom > its answers. > - Select the most relevant and well formatted question/answer-pairs for e= ach > FAQ. > > For the evaluation part I'd like to ask you for having a look at one or t= wo > FAQs and maybe give some comments on how far the questions matched the FA= Q's > title, how relevant they were etc. > > > Here's the direct link to the CouchDB FAQs: > http://faqcluster.com/couchdb-view-document-doc-couch > > And here a quite good example in my opinion: > http://faqcluster.com/question1516894006 > > (There are some other interesting FAQs as well at http://faqcluster.com/) > > > Thanks for your help > > Stefan >