mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charly Lizarralde <charly.lizarra...@gmail.com>
Subject Re: Converting one large text file with multiple documents to SequenceFile format
Date Tue, 30 Oct 2012 21:07:58 GMT
I had the exact same issue and I tried to use the seqdirectory command with
a different filter class but It did not work. It seems there's a bug in the
mahout-0.6 code.

It ended up as writing a custom map-reduce program that performs just that.

Greetiings!
Charly

On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <levar1@hotmail.com> wrote:

>
> I have done a lot of searching on the web for this, but I've found
> nothing, even though I feel like it has to be somewhat common. I have used
> Mahout's 'seqdirectory' command to convert a folder containing text files
> (each file is a separate document) in the past. But in this case there are
> so many documents (in the 100,000s) that I have one very large text file in
> which each line is a document. How can I convert this large file to
> SequenceFile format so that Mahout understands that each line should be
> considered a separate document?  Would it be better if the file was
> structured like so....docId1 {tab} document textdocId2 {tab} document
> textdocId3 {tab} document text...
>
> Thank you very much for any help.Nick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message