mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Grisel <>
Subject Re: Random thought: line separators
Date Mon, 18 Jan 2010 13:58:46 GMT
2010/1/18 Robin Anil <>:
> Its this kind of thing that forced to move to sequence files instead of
> TextKeyValueInput format and other text based/ csv based formats. Kind of
> regretting the decision to go with tab separated format for BayesClassifier
> which i wrote it 2 years ago. I will be modifying this to use sparse vectors
> or the sequence files which ever fits.
> My thought is that this kind of functionality should only be used by the
> format convertors that convert to and back from sequence files. and when
> storing it to sequence files just enforce the \n rule for line breaks

By the way, I tried to run the Bayesian classifier's features
extractor on the following wikipedia chunk:


And I got an EOFException in hadoop related classes (no mahout classes
in the stacktrace). I wonder if this is related, or maybe this is
related to the java serialization used in that step.

The feature extractors works on all other chunks I tried though. All
those chunks were extracted on a linux machine.

Olivier -

View raw message