mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <ssc.o...@googlemail.com>
Subject Re: Generating a Document Similarity Matrix
Date Mon, 28 Jun 2010 20:15:43 GMT
Hi Kris,

Unfortunately I'm not familiar with the VectorDumper code (and a quick
look didn't help either), so I can't help you with the OutOfMemoryError.

It could be possible that only 37,952 results are found for an input of
500,000 vectors, it really depends on the actual data. If you're sure
that there should be more results, you could provide me with a sample
input file and I'll try to find out why there aren't more results.

I wrote a small class for you that dumps the output file of the job to
the console, (I tested it with the output of my unit-tests), maybe that
can help us find the source of the problem.

-sebastian

public class MatrixReader extends AbstractJob {

  public static void main(String[] args) throws Exception {
    ToolRunner.run(new MatrixReader(), args);
  }
 
  @Override
  public int run(String[] args) throws Exception {

    addInputOption();
   
    Map<String,String> parsedArgs = parseArguments(args);
    if (parsedArgs == null) {
      return -1;
    }     
   
    Configuration conf = getConf();
    FileSystem fs = FileSystem.get(conf);
   
    Path vectorFile = fs.listStatus(getInputPath(),
TasteHadoopUtils.PARTS_FILTER)[0].getPath();
   
    SequenceFile.Reader reader = null;
    try {
      reader = new SequenceFile.Reader(fs, vectorFile, conf);
      IntWritable key = new IntWritable();
      VectorWritable value = new VectorWritable();

      while (reader.next(key, value)) {
        int row = key.get();
        System.out.print(String.valueOf(key.get()) +  ": ");
        Iterator<Element> elementsIterator = value.get().iterateNonZero();
        String separator = "";
        while (elementsIterator.hasNext()) {
          Element element = elementsIterator.next();
          System.out.print(separator + String.valueOf(element.index()) +
"," + String.valueOf(element.get()));
          separator = ";";
        }
        System.out.print("\n");
      }
    } finally {
      reader.close();
    }  
    return 0;
  }
}

Am 28.06.2010 17:18, schrieb Kris Jack:
> Hi,
>
> I am now using the version of
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian has
> written and has been added to the trunk.  Thanks again for that!  I can
> generate an output file that should contain a list of documents with their
> top 100* *most similar documents.  I am having problems, however, in
> converting the output file into a readable format using mahout's vectordump:
>
> $ ./mahout vectordump --seqFile similarRows --output results.out --printKey
> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> Input Path: /home/kris/similarRows
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
>     at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>     at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
>     at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>     at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>     at
> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
>     at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>     at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
>
> What is this doing that takes up so much memory?  A file is produced with
> 37,952 readable rows but I'm expecting more like 500,000 results, since I
> have this number of documents.  Should I be using something else to read the
> output file of the RowSimilarityJob?
>
> Thanks,
> Kris
>
>
>
> 2010/6/18 Sebastian Schelter <ssc.open@googlemail.com>
>
>   
>> Hi Kris,
>>
>> maybe you want to give the patch from
>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
>> tested it with larger data yet, but I would be happy to get some
>> feedback for it and maybe it helps you with your usecase.
>>
>> -sebastian
>>
>> Am 18.06.2010 18:46, schrieb Kris Jack:
>>     
>>> Thanks Ted,
>>>
>>> I got that working.  Unfortunately, the matrix multiplication job is
>>>       
>> taking
>>     
>>> far longer than I hoped.  With just over 10 million documents, 10 mappers
>>> and 10 reducers, I can't get it to complete the job in under 48 hours.
>>>
>>> Perhaps you have an idea for speeding it up?  I have already been quite
>>> ruthless with making the vectors sparse.  I did not include terms that
>>> appeared in over 1% of the corpus and only kept terms that appeared at
>>>       
>> least
>>     
>>> 50 times.  Is it normal that the matrix multiplication map reduce task
>>> should take so long to process with this quantity of data and resources
>>> available or do you think that my system is not configured properly?
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/15 Ted Dunning <ted.dunning@gmail.com>
>>>
>>>
>>>       
>>>> Threshold are generally dangerous.  It is usually preferable to specify
>>>>         
>> the
>>     
>>>> sparseness you want (1%, 0.2%, whatever), sort the results in descending
>>>> score order using Hadoop's builtin capabilities and just drop the rest.
>>>>
>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com>
>>>>         
>> wrote:
>>     
>>>>
>>>>         
>>>>>  I was wondering if there was an
>>>>> interesting way to do this with the current mahout code such as
>>>>>
>>>>>           
>>>> requesting
>>>>
>>>>         
>>>>> that the Vector accumulator returns only elements that have values
>>>>>
>>>>>           
>>>> greater
>>>>
>>>>         
>>>>> than a given threshold, sorting the vector by value rather than key,
or
>>>>> something else?
>>>>>
>>>>>
>>>>>           
>>>>         
>>>       
>>
>>     
>
>   


Mime
View raw message