Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 2141 invoked from network); 24 Nov 2009 16:32:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Nov 2009 16:32:37 -0000 Received: (qmail 93541 invoked by uid 500); 24 Nov 2009 16:32:36 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 93482 invoked by uid 500); 24 Nov 2009 16:32:35 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 93472 invoked by uid 99); 24 Nov 2009 16:32:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2009 16:32:35 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.74] (HELO homiemail-a23.g.dreamhost.com) (208.97.132.74) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2009 16:32:33 +0000 Received: from [10.0.0.77] (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by homiemail-a23.g.dreamhost.com (Postfix) with ESMTPA id 65E504B006F for ; Tue, 24 Nov 2009 08:32:08 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1077) Subject: Re: vector generation From: Grant Ingersoll In-Reply-To: Date: Tue, 24 Nov 2009 11:32:11 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: mahout-user@lucene.apache.org X-Mailer: Apple Mail (2.1077) On Nov 24, 2009, at 10:32 AM, Patterson, Josh wrote: > While reading through the wiki and article material on mahout, I = noticed > that there was a pre-generation step where vectors were being = generated > from either text with Lucene or ARFF with > org.apache.mahout.utils.vectorsarff.driver.java; Looking at the = k-means > driver and mapper (KMeansMapper.java) I noticed that the mapper is > taking a key and then a Vector (point) as input. >=20 >=20 >=20 > Would it be smart or practical to make a special record reader for = your > file format that read your data in as vectors directly and emitted > vectors to the mapper in order to skip the pre-generation step? Just > curious about that, maybe I'm missing something there, or = vectorization > would be cumbersome in that position, etc. Probably would be useful. No one has taken the steps yet.=20 >=20 >=20 >=20 > Also, in Grant's article on Mahout he includes the vectorized 2.5 GB > file from Wikipedia that is in the correct format via Lucene to work > with a Mahout clustering algorithm; Is there a smaller (sub 100 meg) > version of this that I could play around with? I'm working with basic > building blocks right now and figuring out the facets of vectorization > with respect to Mahout so we can learn the base case (lucene vectors) > and then move on to our specific case (sensor time series data). Here's what I did: Using Solr, create an index, make sure you turn on term vectors for the = appropriate fields. Point the Lucene Driver at the index and create the vectors. =20 You could do this even using the Solr tutorial (solr/example) which = would give you an index of about 20 docs. Here's the schema.xml I used (or, at least the relevant field = definitions): I also used the EnwikiDocMaker from Lucene's contrib/benchmark plus a = simple SolrJ wrapper.=