Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 18371 invoked from network); 29 Apr 2009 17:28:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Apr 2009 17:28:36 -0000 Received: (qmail 12756 invoked by uid 500); 29 Apr 2009 17:28:36 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 12702 invoked by uid 500); 29 Apr 2009 17:28:35 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 12692 invoked by uid 99); 29 Apr 2009 17:28:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2009 17:28:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shashikant@gmail.com designates 209.85.200.168 as permitted sender) Received: from [209.85.200.168] (HELO wf-out-1314.google.com) (209.85.200.168) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Apr 2009 17:28:26 +0000 Received: by wf-out-1314.google.com with SMTP id 24so913107wfg.20 for ; Wed, 29 Apr 2009 10:28:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=WT+ZfYdlaNbdvMvZS6E9OZvfeNPRU8JBXWUn271jpq0=; b=GyUvDAVtpxjUzqHQ5jINzY1BEOTnmYItyTtvhfjqZ9ryi9PvYv/Z2ef944BwBsGIYF UrJ7HI1f2Yrxs8XmkojqMedJU5+HgXa7vayu2tVQWG0whJiSFpKUHqMb6gIj3TwTmoB6 06e2rWOeNyDQaITIfQEuR3dfH0+Aj3PiQVfEo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=SXaK0QElNQ3ZQcfKl05IwFVeMlAqpLYLIXiLl3P4BwLZ54/RvXWN/gt3rQa2/QJ7aS hsjnGoyyMiTILvmynTMn2mhg7HafrPtcYqEKldvkGVCLr2KPQU20/qI7ab4VOAbnVIu+ Zf5YsW59KcWYABEZkt8C/rRYjNmx5QFwJpW0c= MIME-Version: 1.0 Received: by 10.140.157.4 with SMTP id f4mr2288152rve.3.1241026086065; Wed, 29 Apr 2009 10:28:06 -0700 (PDT) In-Reply-To: <49F88C6E.50505@windwardsolutions.com> References: <17469b150904280601i19c734d1icb30862ac5f10c0@mail.gmail.com> <17469b150904291008s69e17f7j1c4ae760095c7e35@mail.gmail.com> <49F88C6E.50505@windwardsolutions.com> From: Shashikant Kore Date: Wed, 29 Apr 2009 22:57:46 +0530 Message-ID: <17469b150904291027p30141aadu43a4b580b7114e42@mail.gmail.com> Subject: Re: Failure to run Clustering example To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Jeff, The JDK problem occurs while running the example of Synthetic Control Data = from http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html The other query was related to how to convert convert text files to Mahout Vector. Let's say, I have text files of wikipedia pages and now I want to create clusters out of them. How do I get the Mahout vector from the lucene index? Can you point me to some theory behind it, from where I can convert it code? Thanks, --shashi On Wed, Apr 29, 2009 at 10:50 PM, Jeff Eastman wrote: > Hi Shashi, > > That does sound like a JDK version problem. Most jobs require an initial > step to get the input into the correct vector format to use the clusterin= g > code. The > /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcont= rol/canopy/Job.java > calls an InputDriver that does that for the syntheticcontrol examples. Yo= u > would need to do something similar to massage your data into Mahout Vecto= r > format before you can run the clustering job of your choosing. > > Jeff > > Shashikant Kore wrote: >> >> Thanks for the response, Grant. >> >> Upgrading Hadoop didn't really help. Now, I am not able to launch even >> the Namenode, JobTracker, ... as I am getting same error. I suspect >> version conflict somewhere as there are two JDK version on the box. I >> will try it out on another box which has only JDK 6. >> >> >From the documentation of clustering, it is not clear how to get the >> vectors from text (or html) files. I suppose, you can get TF-IDF >> values by indexing this content with Lucene. How does one proceed from >> there? Any pointers on that are appreciated. >> >> --shashi >> >> On Tue, Apr 28, 2009 at 8:40 PM, Grant Ingersoll >> wrote: >> >>> >>> On Apr 28, 2009, at 6:01 AM, Shashikant Kore wrote: >>> >>> >>>> >>>> Hi, >>>> >>>> Initially, I got the version number error at the beginning. I found >>>> that JDK version was 1.5. It has been upgraded it to 1.6. Now >>>> JAVA_HOME points to /usr/java/jdk1.6.0_13/ =A0and I am using Hadoop >>>> 0.18.3. >>>> >>>> 1. What could possibly be wrong? I checked the Hadoop script. Value of >>>> JAVA_HOME is correct (ie 1.6). Is it possible that somehow it is still >>>> using 1.5? >>>> >>> >>> I'm going to guess the issue is that you need Hadoop 0.19. >>> >>>> >>>> 2. The last step the clustering tutorial says "Get the data out of >>>> HDFS and have a look." Can you please point me to the documentation of >>>> Hadoop about how to read this data? >>>> >>> >>> http://hadoop.apache.org/core/docs/current/quickstart.html towards the >>> bottom. =A0It shows some of the commands you can use w/ HDFS. =A0-get, = -cat, >>> etc. >>> >>> >>> -Grant >>> >>> >> >> >> > > --=20 Co-founder, Discrete Log Technologies http://www.bandhan.com/