Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 26167 invoked from network); 28 Oct 2009 08:18:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Oct 2009 08:18:30 -0000 Received: (qmail 42477 invoked by uid 500); 28 Oct 2009 08:18:30 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 42415 invoked by uid 500); 28 Oct 2009 08:18:30 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 42405 invoked by uid 99); 28 Oct 2009 08:18:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Oct 2009 08:18:30 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Oct 2009 08:18:20 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 996C2234C045 for ; Wed, 28 Oct 2009 01:17:59 -0700 (PDT) Message-ID: <1568308598.1256717879617.JavaMail.jira@brutus> Date: Wed, 28 Oct 2009 08:17:59 +0000 (UTC) From: "Sushil Bajracharya (JIRA)" To: mahout-dev@lucene.apache.org Subject: [jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents In-Reply-To: <337322268.1256717759394.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Bajracharya updated MAHOUT-191: -------------------------------------- Status: Patch Available (was: Open) It seems that the problem is because that not all the documents in my index has the field that I am using to get term vectors from. I made the following changes to make this work, but I am not sure if thats the right way. I wanted to get this work to run the LDA topic modeling using the output from the Driver. Index: utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java =================================================================== --- utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java (revision 830343) +++ utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java (working copy) @@ -42,7 +42,7 @@ break; } //point.write(dataOut); - writer.append(new LongWritable(recNum++), point); + if(point!=null) writer.append(new LongWritable(recNum++), point); } return recNum; Index: utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java =================================================================== --- utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java (revision 830343) +++ utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java (working copy) @@ -104,6 +104,10 @@ try { indexReader.getTermFreqVector(doc, field, mapper); result = mapper.getVector(); + + if (result == null) + return null; + if (idField != null) { String id = indexReader.document(doc, idFieldSelector).get(idField); result.setName(id); > NPE while creating term vectors with an index on a field that does not exist in all the documents > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-191 > URL: https://issues.apache.org/jira/browse/MAHOUT-191 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.3 > Environment: mac, snow leopard, eclipse galileo, jdk 6 > Reporter: Sushil Bajracharya > > (based on the message from here: http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) > I checked out mahout from trunk and tried to create term frequency vector from a lucene index and ran into this.. > 09/10/27 17:36:10 INFO lucene.Driver: Output File: /Users/shoeseal/DATA/luc2tvec.out > 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.NullPointerException > at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) > at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) > at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) > at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) > I am running this from Eclipse (snow leopard with JDK 6), on an index that has field with stored term vectors.. > my input parameters for Driver are: > --dir /smallidx/ --output /luc2tvec.out --idField id_field > --field field_with_TV --dictOut /luc2tvec.dict --max 50 --weight tf > Luke shows the following info on the fields I am using: > id_field is indexed, stored, omit norms > field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.