From user-return-20597-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Mon Jul 14 20:37:36 2014 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D00FA11436 for ; Mon, 14 Jul 2014 20:37:36 +0000 (UTC) Received: (qmail 66616 invoked by uid 500); 14 Jul 2014 20:37:35 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 66555 invoked by uid 500); 14 Jul 2014 20:37:35 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 66543 invoked by uid 99); 14 Jul 2014 20:37:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2014 20:37:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of beancinematics@gmail.com designates 209.85.220.171 as permitted sender) Received: from [209.85.220.171] (HELO mail-vc0-f171.google.com) (209.85.220.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2014 20:37:30 +0000 Received: by mail-vc0-f171.google.com with SMTP id id10so8519566vcb.30 for ; Mon, 14 Jul 2014 13:37:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MV4wf3uNdYYn6TvzsTRqPT/6Gg+GrniTV1I1oYUSTW8=; b=cq5aFRBRxDqTs8e0BGm5R98vJBWH/HgWstWaGPsDKf9D2u6HuHN1dxxbKBjdUnhVEK BvTQISV4fkhJJgiS/MMXmd9/Af81lnJEX1NZz8nIEm3O8lyYaCG/pYz0xOcAZ3GIu8eL DtazPDDcr16WPKiXLr2J/5MbHLbWTyK4I+MTTZvjhVU228B2zUV0JLQgHqkhM2pG0ux7 o3BVtzQKzZ/5YXKuDYWb011G3EvPj1UHxAELGpNbggMV1hdfpSjuw5lnBSehDBT3vXii OKtZADqJMMgdxV6MYQZ/NPezz8SDyMtMW4yvwrJazh6qb8ypd4XZb0/QpFeTLMzhcoL+ 7XaA== MIME-Version: 1.0 X-Received: by 10.53.13.200 with SMTP id fa8mr1739046vdd.57.1405370230273; Mon, 14 Jul 2014 13:37:10 -0700 (PDT) Received: by 10.220.196.144 with HTTP; Mon, 14 Jul 2014 13:37:10 -0700 (PDT) In-Reply-To: References: Date: Mon, 14 Jul 2014 15:37:10 -0500 Message-ID: Subject: Re: CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files From: Mohammed Omer To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001a1134a1da5a801004fe2d4177 X-Virus-Checked: Checked by ClamAV on apache.org --001a1134a1da5a801004fe2d4177 Content-Type: text/plain; charset=UTF-8 All - to help illustrate the issue, I've put together my mahout cvb script and some truncated output files here for your review with real data: https://gist.github.com/momer/3ddaaa0c291a91d25709 Not sure if this is frowned upon, but to expedite some eyes on this issue, I'll donate $200 to the Apache foundation if we can figure this out by the end of the week; and, $100 if we can figure it out by the end of next week! Thank you, Mo On Sun, Jul 13, 2014 at 1:06 PM, Mohammed Omer wrote: > All - I'm having the same issue as mentioned at > http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout > 0.9. My CVB clusters describe my corpus well; however, the mapping file > generated by mahout's `rowid` seems to be wayyyyyy off. > > For example, there's a very obvious cluster which has keywords like "beer, > stout, pale" - the only cluster to contain these keywords. In my vectordump > for the p(term | topic) this cluster is at line 217. Vector dump generated > by: > > echo `date` ": Dumping the p(term | topic) vectors to local filesystem..." > $mahout_bin/mahout vectordump -i results/cvb_results/to_out \ > --dictionary results/seq2sparse_results/dictionary.file-0 \ > --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \ > -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile > > And, while the results of dumping out the p(doc | topic) group all of the > documents which contain the words "beer, stout, pale" together - it dumps > them into cluster number 8. The dump is created via: > > echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..." > $mahout_bin/mahout vectordump -i results/cvb_results/do_out \ > -sort results/cvb_results/do_out \ > -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true > > IE: the result from the p(doc | topic) dump will result in: > > 123 0.001,...,0.60,... > > Where 123 maps to a document about "beer, stout, pale" and where 0.60 is > the 9th comma separated value -- thus belonging to cluster id#8 (at zero > index). > > However, if we look at the p(term | topic) file dumped earlier, cluster > id#8 has nothing to do with this document. > > Additionally, I wrote a script to review all of the documents belonging to > any given cluster; and, all of the documents in cluster #8 actually map to > the p(term|topic) entry described by cluster #217. That is to say, these > are the only documents containing the ngrams / keywords that cluster #217 > shows as describing it. > > I can't figure out where the gap is: Is it in the rowid docIndex/matrix I > have? I've tried dumping the above two files without sorting as I figured > that might be rearranging the ordering of cluster probabilities in the > p(doc | topic) dump, but that turned up inconclusive I believe. > > I would love any ideas - I've been stumped on this for a little while now. > > Thank you, > > Mo > --001a1134a1da5a801004fe2d4177--