Return-Path: Delivered-To: apmail-lucene-mahout-commits-archive@minotaur.apache.org Received: (qmail 28618 invoked from network); 13 Feb 2010 17:49:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Feb 2010 17:49:29 -0000 Received: (qmail 76656 invoked by uid 500); 13 Feb 2010 17:49:29 -0000 Delivered-To: apmail-lucene-mahout-commits-archive@lucene.apache.org Received: (qmail 76580 invoked by uid 500); 13 Feb 2010 17:49:28 -0000 Mailing-List: contact mahout-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-commits@lucene.apache.org Received: (qmail 76571 invoked by uid 99); 13 Feb 2010 17:49:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Feb 2010 17:49:28 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Feb 2010 17:49:20 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 10E4529A0013 for ; Sat, 13 Feb 2010 09:49:00 -0800 (PST) Date: Sat, 13 Feb 2010 17:49:00 +0000 (UTC) From: confluence@apache.org To: mahout-commits@lucene.apache.org Message-ID: <1250155228.355.1266083340068.JavaMail.www-data@brutus.apache.org> Subject: [CONF] Apache Lucene Mahout > Creating Vectors from Text MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Auto-Submitted: auto-generated Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT) Page: Creating Vectors from Text (http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text) Edited by Grant Ingersoll: --------------------------------------------------------------------- +*Mahout_0.2*+ {toc:style=disc|indent=20px} h1. Introduction For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering [Algorithms]. These approaches are described below. h1. From Lucene Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index. For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using [Solr|http://lucene.apache.org/solr] as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the Lucene [website|http://lucene.apache.org/java] or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike McCandless. To get started, make sure you get a fresh copy of Mahout from [SVN|http://cwiki.apache.org/MAHOUT/buildingmahout.html] and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a Data Source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the Driver program located in the org.apache.mahout.utils.vectors package. The Driver program offers several input options, which can be displayed by specifying the --help option. Examples of running the Driver are included below: h2. Generating an output file from a Lucene Index {noformat} $MAHOUT_HOME/bin/mahout lucenevector \ --output --field --dictOut > <--norm {INF|integer >= 0}> <--idField > {noformat} h3. Create 50 Vectors from an Index {noformat} $MAHOUT_HOME/bin/mahout lucenevector --dir /wikipedia/solr/data/index --field body \ --dictOut /solr/wikipedia/dict.txt --output /solr/wikipedia/out.txt --max 50 {noformat} This uses the index specified by --dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output. h3. Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space] {noformat} $MAHOUT_HOME/bin/mahout lucenevector --dir /wikipedia/solr/data/index --field body \ --dictOut /solr/wikipedia/dict.txt --output /solr/wikipedia/out.txt --max 50 --norm 2 {noformat} h1. From Directory of Text documents Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format. You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text. h2. Converting directory of documents to SequenceFile format Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. the document id generated is /document.txt >From the examples directory run {noformat} $MAHOUT_HOME/bin/mahout seqdirectory \ --input --output \ <-c {UTF-8|cp1252|ascii...}> \ <-chunk 64> \ <-prefix > {noformat} h2. Creating Vectors from SequenceFile +*Mahout_0.3*+ >From the sequence file generated from the above step run the following to generate vectors. {noformat} $MAHOUT_HOME/bin/mahout seq2sparse \ -i -o \ <-wt {tf|tfidf}> \ <-chunk 100> \ <-a org.apache.lucene.analysis.standard.StandardAnalyzer> \ <--minSupport 2> \ <--minDF 1> \ <--maxDFPercent 99> \ <--norm {INF|integer >= 0}>" <-seq {false|true required for running some algorithms(LDA,Lanczos)}>" {noformat} --minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. This helps remove high frequency features like stop words h1. Background * http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c * http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering h1. From a Database +*TODO:*+ h1. Other h2. Converting existing vectors to Mahout's format If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable (called VectorIterable in the example below) and then reuse the existing VectorWriter classes: {code} VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class); long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE); {code} Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action