Hi Kris,
I'm glad I could help you and it's really cool that you are testing my
patches on real data. I'm looking forward to hearing more!
-sebastian
Am 29.06.2010 11:25, schrieb Kris Jack:
> Hi Sebastian,
>
> You really are very kind! I have taken your code and run it to print out
> the contents of the output file. There are indeed only 37,952 results so
> that gives me more confidence in the vector dumper. I'm not sure why there
> was a memory problem though, seeing as it seems to have output the results
> correctly. Now I just have to match them up with my original lucene ids and
> see how it is performing. I'll keep you posted with the results.
>
> Thanks,
> Kris
>
>
>
> 2010/6/28 Sebastian Schelter <ssc.open@googlemail.com>
>
>
>> Hi Kris,
>>
>> Unfortunately I'm not familiar with the VectorDumper code (and a quick
>> look didn't help either), so I can't help you with the OutOfMemoryError.
>>
>> It could be possible that only 37,952 results are found for an input of
>> 500,000 vectors, it really depends on the actual data. If you're sure
>> that there should be more results, you could provide me with a sample
>> input file and I'll try to find out why there aren't more results.
>>
>> I wrote a small class for you that dumps the output file of the job to
>> the console, (I tested it with the output of my unit-tests), maybe that
>> can help us find the source of the problem.
>>
>> -sebastian
>>
>> public class MatrixReader extends AbstractJob {
>>
>> public static void main(String[] args) throws Exception {
>> ToolRunner.run(new MatrixReader(), args);
>> }
>>
>> @Override
>> public int run(String[] args) throws Exception {
>>
>> addInputOption();
>>
>> Map<String,String> parsedArgs = parseArguments(args);
>> if (parsedArgs == null) {
>> return -1;
>> }
>>
>> Configuration conf = getConf();
>> FileSystem fs = FileSystem.get(conf);
>>
>> Path vectorFile = fs.listStatus(getInputPath(),
>> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
>>
>> SequenceFile.Reader reader = null;
>> try {
>> reader = new SequenceFile.Reader(fs, vectorFile, conf);
>> IntWritable key = new IntWritable();
>> VectorWritable value = new VectorWritable();
>>
>> while (reader.next(key, value)) {
>> int row = key.get();
>> System.out.print(String.valueOf(key.get()) + ": ");
>> Iterator<Element> elementsIterator = value.get().iterateNonZero();
>> String separator = "";
>> while (elementsIterator.hasNext()) {
>> Element element = elementsIterator.next();
>> System.out.print(separator + String.valueOf(element.index()) +
>> "," + String.valueOf(element.get()));
>> separator = ";";
>> }
>> System.out.print("\n");
>> }
>> } finally {
>> reader.close();
>> }
>> return 0;
>> }
>> }
>>
>> Am 28.06.2010 17:18, schrieb Kris Jack:
>>
>>> Hi,
>>>
>>> I am now using the version of
>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian
>>>
>> has
>>
>>> written and has been added to the trunk. Thanks again for that! I can
>>> generate an output file that should contain a list of documents with
>>>
>> their
>>
>>> top 100* *most similar documents. I am having problems, however, in
>>> converting the output file into a readable format using mahout's
>>>
>> vectordump:
>>
>>> $ ./mahout vectordump --seqFile similarRows --output results.out
>>>
>> --printKey
>>
>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>> Input Path: /home/kris/similarRows
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>> at
>>>
>>>
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
>>
>>> at
>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>> at
>>>
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
>>
>>> at
>>>
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>>
>>> at
>>>
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>>
>>> at
>>>
>>>
>> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
>>
>>> at
>>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>>
>>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>>> at
>>>
>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at
>>>
>>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>
>>> at
>>>
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>
>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
>>>
>>> What is this doing that takes up so much memory? A file is produced with
>>> 37,952 readable rows but I'm expecting more like 500,000 results, since I
>>> have this number of documents. Should I be using something else to read
>>>
>> the
>>
>>> output file of the RowSimilarityJob?
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/18 Sebastian Schelter <ssc.open@googlemail.com>
>>>
>>>
>>>
>>>> Hi Kris,
>>>>
>>>> maybe you want to give the patch from
>>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
>>>> tested it with larger data yet, but I would be happy to get some
>>>> feedback for it and maybe it helps you with your usecase.
>>>>
>>>> -sebastian
>>>>
>>>> Am 18.06.2010 18:46, schrieb Kris Jack:
>>>>
>>>>
>>>>> Thanks Ted,
>>>>>
>>>>> I got that working. Unfortunately, the matrix multiplication job is
>>>>>
>>>>>
>>>> taking
>>>>
>>>>
>>>>> far longer than I hoped. With just over 10 million documents, 10
>>>>>
>> mappers
>>
>>>>> and 10 reducers, I can't get it to complete the job in under 48 hours.
>>>>>
>>>>> Perhaps you have an idea for speeding it up? I have already been quite
>>>>> ruthless with making the vectors sparse. I did not include terms that
>>>>> appeared in over 1% of the corpus and only kept terms that appeared at
>>>>>
>>>>>
>>>> least
>>>>
>>>>
>>>>> 50 times. Is it normal that the matrix multiplication map reduce task
>>>>> should take so long to process with this quantity of data and resources
>>>>> available or do you think that my system is not configured properly?
>>>>>
>>>>> Thanks,
>>>>> Kris
>>>>>
>>>>>
>>>>>
>>>>> 2010/6/15 Ted Dunning <ted.dunning@gmail.com>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Threshold are generally dangerous. It is usually preferable to
>>>>>>
>> specify
>>
>>>>>>
>>>> the
>>>>
>>>>
>>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
>>>>>>
>> descending
>>
>>>>>> score order using Hadoop's builtin capabilities and just drop the
>>>>>>
>> rest.
>>
>>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com>
>>>>>>
>>>>>>
>>>> wrote:
>>>>
>>>>
>>>>>>
>>>>>>
>>>>>>> I was wondering if there was an
>>>>>>> interesting way to do this with the current mahout code such
as
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> requesting
>>>>>>
>>>>>>
>>>>>>
>>>>>>> that the Vector accumulator returns only elements that have values
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> greater
>>>>>>
>>>>>>
>>>>>>
>>>>>>> than a given threshold, sorting the vector by value rather than
key,
>>>>>>>
>> or
>>
>>>>>>> something else?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
|