mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Grisel <olivier.gri...@ensta.org>
Subject Re: Random thought: line separators
Date Mon, 18 Jan 2010 13:58:46 GMT
2010/1/18 Robin Anil <robin.anil@gmail.com>:
> Its this kind of thing that forced to move to sequence files instead of
> TextKeyValueInput format and other text based/ csv based formats. Kind of
> regretting the decision to go with tab separated format for BayesClassifier
> which i wrote it 2 years ago. I will be modifying this to use sparse vectors
> or the sequence files which ever fits.
>
> My thought is that this kind of functionality should only be used by the
> format convertors that convert to and back from sequence files. and when
> storing it to sequence files just enforce the \n rule for line breaks

By the way, I tried to run the Bayesian classifier's features
extractor on the following wikipedia chunk:

s3://enwiki-pages-articles/enwiki-20090810-pages-articles/chunk-0001.xml

And I got an EOFException in hadoop related classes (no mahout classes
in the stacktrace). I wonder if this is related, or maybe this is
related to the java serialization used in that step.

The feature extractors works on all other chunks I tried though. All
those chunks were extracted on a linux machine.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Mime
View raw message