mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > Creating Vectors from Text
Date Sat, 01 Aug 2009 14:40:01 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Creating Vectors from Text (http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text)


Edited by Grant Ingersoll:
---------------------------------------------------------------------
+*Mahout_0.2*+

h1. Introduction

For clustering documents it is usually necessary to convert the raw text into vectors that
can then be consumed by the clustering [Algorithms].  These approaches are described below.

h1. From Lucene

Mahout has utilities that allow one to easily produce Mahout Vector representations from a
Lucene (and Solr, since they are they same) index.

For this, we assume you know how to build a Lucene/Solr index.  For those who don't, it is
probably easiest to get up and running using [Solr|http://lucene.apache.org/solr] as it can
ingest things like PDFs, XML, Office, etc. and create a Lucene index.  For those wanting to
use just Lucene, see the Lucene [website|http://lucene.apache.org/java] or check out _Lucene
In Action_ by Erik Hatcher, Otis Gospodnetic and Mike McCandless.

To get started, make sure you get a fresh copy of Mahout from [SVN|http://cwiki.apache.org/MAHOUT/buildingmahout.html]
and are comfortable building it.  You will also need to [apply the patch|http://cwiki.apache.org/MAHOUT/howtocontribute.html]
on MAHOUT-126.  This patch creates a "utils" module in Mahout at the same level as the Core
that defines utilities for working with Mahout.  In this case, it defines interfaces and implementations
for efficiently iterating over a Data Source (it only supports Lucene currently, but should
be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary
which can then be used for clustering.   The main code for driving this is the Driver program
located in the org.apache.mahout.utils.vectors package.  The Driver program offers several
input options, which can be displayed by specifying the --help option.  Examples of running
the Driver are included below:

h2. Generating an output file from a Lucene Index

{noformat}
java -cp <CLASSPATH> org.apache.mahout.utils.vectors.lucene.Driver --dir <PATH TO
DIRECTORY CONTAINING LUCENE INDEX> --out <PATH TO OUTPUT LOCATION> --field <NAME
OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] <--max <Number
of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of
the idField in the Lucene index>>
{noformat}

h3. Create 50 Vectors from an Index 
{noformat}
org.apache.mahout.utils.vectors.lucene.Driver --dir <PATH>/wikipedia/solr/data/index
--field body --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt
--max 50
{noformat}
This uses the index specified by --dir and the body field in it and writes out the info to
the output dir and the dictionary to dict.txt.  It only outputs 50 vectors.  If you don't
specify --max, then all the documents in the index are output.

h3. Normalize 50 Vectors from an Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]
{noformat}
org.apache.mahout.utils.vectors.lucene.Driver --dir <PATH>/wikipedia/solr/data/index
--field body --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt
--max 50 --norm 2
{noformat}








h2. Background

* http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
* http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

h1. From a Database

+*TODO:*+

h1. Other

+*TODO:*+


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message