Return-Path:
X-Original-To: apmail-mahout-commits-archive@www.apache.org
Delivered-To: apmail-mahout-commits-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 65B53102D7
for ;
Wed, 20 Nov 2013 15:40:08 +0000 (UTC)
Received: (qmail 46208 invoked by uid 500); 20 Nov 2013 15:40:06 -0000
Delivered-To: apmail-mahout-commits-archive@mahout.apache.org
Received: (qmail 46164 invoked by uid 500); 20 Nov 2013 15:40:05 -0000
Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: dev@mahout.apache.org
Delivered-To: mailing list commits@mahout.apache.org
Received: (qmail 46153 invoked by uid 99); 20 Nov 2013 15:40:05 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 15:40:05 +0000
X-ASF-Spam-Status: No, hits=-2000.0 required=5.0
tests=ALL_TRUSTED
X-Spam-Check-By: apache.org
Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 15:40:03 +0000
Received: from eris.apache.org (localhost [127.0.0.1])
by eris.apache.org (Postfix) with ESMTP id 85FE02388A56
for ; Wed, 20 Nov 2013 15:39:43 +0000 (UTC)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: svn commit: r887363 - in /websites/staging/mahout/trunk/content: ./
users/basics/creating-vectors-from-text.html
Date: Wed, 20 Nov 2013 15:39:43 -0000
To: commits@mahout.apache.org
From: buildbot@apache.org
X-Mailer: svnmailer-1.0.9
Message-Id: <20131120153943.85FE02388A56@eris.apache.org>
X-Virus-Checked: Checked by ClamAV on apache.org
Author: buildbot
Date: Wed Nov 20 15:39:43 2013
New Revision: 887363
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 20 15:39:43 2013
@@ -1 +1 @@
-1543844
+1543845
Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Wed Nov 20 15:39:43 2013
@@ -415,22 +415,26 @@ option. Examples of running the Driver
Generating an output file from a Lucene Index
- $MAHOUT_HOME/bin/mahout lucene.vector \
+ $MAHOUT_HOME/bin/mahout lucene.vector
+
--output --field --dictOut > <--norm {INF|integer >= 0}>
-<--idField >
+
+ <--idField >
Create 50 Vectors from an Index
- $MAHOUT_HOME/bin/mahout lucene.vector --dir
-/wikipedia/solr/data/index --field body \
- --dictOut /solr/wikipedia/dict.txt --output
-/solr/wikipedia/out.txt --max 50
+
+ $MAHOUT_HOME/bin/mahout lucene.vector --dir /wikipedia/solr/data/index --field body
+
+ --dictOut /solr/wikipedia/dict.txt
+
+ --output /solr/wikipedia/out.txt --max 50
+
This uses the index specified by --dir and the body field in it and writes
@@ -439,13 +443,16 @@ outputs 50 vectors. If you don't specif
the index are output.
L_2Norm">
Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]
-$MAHOUT_HOME/bin/mahout lucene.vector --dir
-
+
+
+ $MAHOUT_HOME/bin/mahout lucene.vector --dir /wikipedia/solr/data/index --field body
+ --dictOut /solr/wikipedia/dict.txt
+
+ --output /solr/wikipedia/out.txt --max 50 --norm 2
+
+
-/wikipedia/solr/data/index --field body \
- --dictOut /solr/wikipedia/dict.txt --output
-/solr/wikipedia/out.txt --max 50 --norm 2
From Directory of Text documents
Mahout has utilities to generate Vectors from a directory of text
@@ -464,11 +471,16 @@ the document id generated is /document.txt
From the examples directory run
- $MAHOUT_HOME/bin/mahout seqdirectory \
- --input --output
@@ -477,19 +489,33 @@ PARENT>/document.txt
From the sequence file generated from the above step run the following to
generate vectors.
- $MAHOUT_HOME/bin/mahout seq2sparse \
- -i -o \
- <-wt {tf|tfidf}> \
- <-chunk 100> \
+ $MAHOUT_HOME/bin/mahout seq2sparse
+
+ -i
+
+ -o
+
+ <-wt {tf|tfidf}>
+
+ <-chunk 100>
+
<-a
-org.apache.lucene.analysis.standard.StandardAnalyzer> \
- <--minSupport 2> \
- <--minDF 1> \
- <--maxDFPercent 99> \
+
+
+
+
+org.apache.lucene.analysis.standard.StandardAnalyzer>
+
+ <--minSupport 2>
+
+ <--minDF 1>
+
+ <--maxDFPercent 99>
+
<--norm {INF|integer >= 0}>"
- <-seq {false|true required for running some
-algorithms(LDA,Lanczos)}>"
+
+ <-seq {false|true required for running some algorithms(LDA,Lanczos)}>"
+
--minSupport is the min frequency for the word to be considered as a
@@ -511,9 +537,10 @@ format. Probably the easiest way to go w
Iterable (called VectorIterable in the example below) and then
reuse the existing VectorWriter classes:
- VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
-configuration, outfile, LongWritable.class, SparseVector.class);
+ VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class);
+
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+