mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushil Bajracharya (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
Date Wed, 28 Oct 2009 08:17:59 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sushil Bajracharya updated MAHOUT-191:
--------------------------------------

    Status: Patch Available  (was: Open)

It seems that the problem is because that not all the documents in my index has the field
that I am using to get term vectors from. I made the following changes to make this work,
but I am not sure if thats the right way. I wanted to get this work to run the LDA topic modeling
using the output from the Driver.

Index: utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
===================================================================
--- utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java (revision
830343)
+++ utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java (working
copy)
@@ -42,7 +42,7 @@
         break;
       }
       //point.write(dataOut);
-      writer.append(new LongWritable(recNum++), point);
+      if(point!=null) writer.append(new LongWritable(recNum++), point);
 
     }
     return recNum;
Index: utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
===================================================================
--- utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java (revision
830343)
+++ utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java (working
copy)
@@ -104,6 +104,10 @@
       try {
         indexReader.getTermFreqVector(doc, field, mapper);
         result = mapper.getVector();
+        
+        if (result == null)
+         return null;
+        
         if (idField != null) {
           String id = indexReader.document(doc, idFieldSelector).get(idField);
           result.setName(id);

> NPE while creating term vectors with an index on a field that does not exist in all the
documents
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-191
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-191
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.3
>         Environment: mac, snow leopard, eclipse galileo, jdk 6
>            Reporter: Sushil Bajracharya
>
> (based on the message from here: http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)
> I checked out mahout from trunk and tried to create term frequency vector from a lucene
index and ran into this..
> 09/10/27 17:36:10 INFO lucene.Driver: Output File: /Users/shoeseal/DATA/luc2tvec.out
> 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
> 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
>         at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
>         at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
>         at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
> I am running this from Eclipse (snow leopard with JDK 6), on an index that has field
with stored term vectors..
> my input parameters for Driver are:
> --dir <path>/smallidx/ --output <path>/luc2tvec.out --idField id_field
>  --field field_with_TV --dictOut <path>/luc2tvec.dict --max 50  --weight tf
> Luke shows the following info on the fields I am using:
>  id_field is indexed, stored, omit norms
>  field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message