Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65B53102D7 for ; Wed, 20 Nov 2013 15:40:08 +0000 (UTC) Received: (qmail 46208 invoked by uid 500); 20 Nov 2013 15:40:06 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 46164 invoked by uid 500); 20 Nov 2013 15:40:05 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 46153 invoked by uid 99); 20 Nov 2013 15:40:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 15:40:05 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 15:40:03 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 85FE02388A56 for ; Wed, 20 Nov 2013 15:39:43 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r887363 - in /websites/staging/mahout/trunk/content: ./ users/basics/creating-vectors-from-text.html Date: Wed, 20 Nov 2013 15:39:43 -0000 To: commits@mahout.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20131120153943.85FE02388A56@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: buildbot Date: Wed Nov 20 15:39:43 2013 New Revision: 887363 Log: Staging update by buildbot for mahout Modified: websites/staging/mahout/trunk/content/ (props changed) websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Propchange: websites/staging/mahout/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Wed Nov 20 15:39:43 2013 @@ -1 +1 @@ -1543844 +1543845 Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html ============================================================================== --- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original) +++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Wed Nov 20 15:39:43 2013 @@ -415,22 +415,26 @@ option. Examples of running the Driver

Generating an output file from a Lucene Index

- $MAHOUT_HOME/bin/mahout lucene.vector \ + $MAHOUT_HOME/bin/mahout lucene.vector + --output --field --dictOut > <--norm {INF|integer >= 0}> -<--idField > + + <--idField >

Create 50 Vectors from an Index

- $MAHOUT_HOME/bin/mahout lucene.vector --dir -/wikipedia/solr/data/index --field body \ - --dictOut /solr/wikipedia/dict.txt --output -/solr/wikipedia/out.txt --max 50 + + $MAHOUT_HOME/bin/mahout lucene.vector --dir /wikipedia/solr/data/index --field body + + --dictOut /solr/wikipedia/dict.txt + + --output /solr/wikipedia/out.txt --max 50 +

This uses the index specified by --dir and the body field in it and writes @@ -439,13 +443,16 @@ outputs 50 vectors. If you don't specif the index are output.

L_2Norm">

Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]

-
$MAHOUT_HOME/bin/mahout lucene.vector --dir
-
+
+ + $MAHOUT_HOME/bin/mahout lucene.vector --dir /wikipedia/solr/data/index --field body + --dictOut /solr/wikipedia/dict.txt + + --output /solr/wikipedia/out.txt --max 50 --norm 2 + +
-

/wikipedia/solr/data/index --field body \ - --dictOut /solr/wikipedia/dict.txt --output -/solr/wikipedia/out.txt --max 50 --norm 2

From Directory of Text documents

Mahout has utilities to generate Vectors from a directory of text @@ -464,11 +471,16 @@ the document id generated is /document.txt

From the examples directory run

- $MAHOUT_HOME/bin/mahout seqdirectory \ - --input --output \ - <-c {UTF-8|cp1252|ascii...}> \ - <-chunk 64> \ + + $MAHOUT_HOME/bin/mahout seqdirectory + --input --output + + <-c {UTF-8|cp1252|ascii...}> + + <-chunk 64> + <-prefix > +

@@ -477,19 +489,33 @@ PARENT>/document.txt

From the sequence file generated from the above step run the following to generate vectors.

- $MAHOUT_HOME/bin/mahout seq2sparse \ - -i -o \ - <-wt {tf|tfidf}> \ - <-chunk 100> \ + $MAHOUT_HOME/bin/mahout seq2sparse + + -i + + -o + + <-wt {tf|tfidf}> + + <-chunk 100> + <-a -org.apache.lucene.analysis.standard.StandardAnalyzer> \ - <--minSupport 2> \ - <--minDF 1> \ - <--maxDFPercent 99> \ + +
+ +
+org.apache.lucene.analysis.standard.StandardAnalyzer> + + <--minSupport 2> + + <--minDF 1> + + <--maxDFPercent 99> + <--norm {INF|integer >= 0}>" - <-seq {false|true required for running some -algorithms(LDA,Lanczos)}>" + + <-seq {false|true required for running some algorithms(LDA,Lanczos)}>" +

--minSupport is the min frequency for the word to be considered as a @@ -511,9 +537,10 @@ format. Probably the easiest way to go w Iterable (called VectorIterable in the example below) and then reuse the existing VectorWriter classes:

- VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, -configuration, outfile, LongWritable.class, SparseVector.class); + VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class); + long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE); +