mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit...@gmail.com>
Subject Re: Use Naïve Bayes on a large CSV
Date Thu, 20 Feb 2014 12:56:34 GMT
This relates to a previous question I have:  Does mahout have a concept of adapters which allow
us to read data csv style data with filters to create exact format  for its various inputs
(i.e. Recommender three column format).?  If not is it worth a jira?


> On Feb 20, 2014, at 7:50 AM, Kevin Moulart <kevinmoulart@gmail.com> wrote:
> 
> Hi and thanks !
> 
> What about the command line, is there a way to do that using the existing
> command line ?
> 
> 
> 
> 
> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <suneel_marthi@yahoo.com>:
> 
>> To convert input CSV to vectors, u can either:
>> 
>> a) Use CSVIterator
>> b) use InputDriver
>> 
>> Either of the above should generate vectors from input CSV that could then
>> be fed into Mahout classifier/clustering jobs.
>> 
>> 
>> 
>> 
>> 
>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
>> kevinmoulart@gmail.com> wrote:
>> 
>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
>> the command line.
>> 
>> I know I have to feed the classifier with a seq file, so I tried to put my
>> csv into one using the command seqdirectory, but even when I try with a
>> really small csv (less than 100Mo) I instantly get an outOfMemoryException
>> from java heap space :
>> 
>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o "/user/cacf/resSeq"
>>> -ow
>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>> Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
>>> --output=[/user/cacf/resSeq],
>> --overwrite=null, --startPhase=[0],
>>> --tempDir=[temp]}
>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:2367)
>>> at
>> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
>>> at
>> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
>>> at
>> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
>>> at
>> org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)
>>> at
>> org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)
>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
>>> at
>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
>>> at
>> org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>> at
>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>> 
>> 
>> Do you have an idea or a simple way to use Naive Bayes against my large CSV
>> ?
>> 
>> Thanks in advance !
>> --
>> Kévin Moulart
>> GSM France : +33 7 81 06 10 10
>> GSM Belgique : +32 473 85 23 85
>> Téléphone fixe : +32 2 771 88 45
> 
> 
> 
> -- 
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45

Mime
View raw message