mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: vector generation
Date Tue, 24 Nov 2009 16:32:11 GMT

On Nov 24, 2009, at 10:32 AM, Patterson, Josh wrote:

> While reading through the wiki and article material on mahout, I noticed
> that there was a pre-generation step where vectors were being generated
> from either text with Lucene or ARFF with
>; Looking at the k-means
> driver and mapper ( I noticed that the mapper is
> taking a key and then a Vector (point) as input.
> Would it be smart or practical to make a special record reader for your
> file format that read your data in as vectors directly and emitted
> vectors to the mapper in order to skip the pre-generation step? Just
> curious about that, maybe I'm missing something there, or vectorization
> would be cumbersome in that position, etc.

Probably would be useful.  No one has taken the steps yet. 

> Also, in Grant's article on Mahout he includes the vectorized 2.5 GB
> file from Wikipedia that is in the correct format via Lucene to work
> with a Mahout clustering algorithm; Is there a smaller (sub 100 meg)
> version of this that I could play around with? I'm working with basic
> building blocks right now and figuring out the facets of vectorization
> with respect to Mahout so we can learn the base case  (lucene vectors)
> and then move on to our specific case (sensor time series data).

Here's what I did:
Using Solr, create an index, make sure you turn on term vectors for the appropriate fields.
Point the Lucene Driver at the index and create the vectors.  

You could do this even using the Solr tutorial (solr/example) which would give you an index
of about 20 docs.

Here's the schema.xml I used (or, at least the relevant field definitions):
<field name="docid" type="string" indexed="true" stored="true" required="true"/>
        <field name="file" type="string" indexed="true" stored="true" />

        <field name="doctitle" type="text" indexed="true" stored="true" multiValued="true"
        <field name="body" type="text" indexed="true" stored="true" multiValued="true"
        <field name="docdate" type="date" indexed="true" stored="true" multiValued="false"/>

        <field name="titleBody" type="text" indexed="true" stored="false" multiValued="true"

        <field name="spell" type="text_spell" indexed="true" stored="false" multiValued="true"/>
        <!-- Here, default is used to create a "timestamp" field indicating
           When each document was indexed.
        <field name="timestamp" type="date" indexed="true" stored="true" default="NOW"

I also used the EnwikiDocMaker from Lucene's contrib/benchmark plus a simple SolrJ wrapper.
View raw message