mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Ceccarelli <diego.ceccare...@gmail.com>
Subject Re: Converting one large text file with multiple documents to SequenceFile format
Date Thu, 01 Nov 2012 00:07:29 GMT
Hei Nick,
I had exatly the same problem ;)
I wrote a simple command line utility to create a sequence
file where each line of the input document is an entry
(the key is the line number).

https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar

java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
-input tweets -output tweets.seq

enjoy ;)
Diego

On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
<charly.lizarralde@gmail.com> wrote:
> I don't think you need that. Just a simple mapper.
>
> static class IdentityMapper extends  Mapper<LongWritable, Text, Text, Text>
> {
>
>         @Override
>         protected void map(LongWritable key, Text value, Context context)
> throws IOException, InterruptedException {
>
>             String[] fields = value.toString().split("\t") ;
>             if  ( fields.length >= 2) {
>                 context.write(new Text(fields[0]), new Text(fields[1]))
> ;
>             }
>
>         }
>
>     }
>
> and then run a simple job..
>
>         Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
> this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> Text.class, Text.class, SequenceFileOutputFormat.class) ;
>
>         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
>         text2SequenceFileJob.setOutputValueClass(Text.class) ;
>         text2SequenceFileJob.setNumReduceTasks(0) ;
>
>         text2SequenceFileJob.waitForCompletion(true) ;
>
> Cheers!
> Charly
>
> On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <levar1@hotmail.com> wrote:
>
>>
>> Yeah, I've looked at filter classes, but nothing worked.  I guess I'll do
>> something similar and continuously save each line into a file and then run
>> seqdiretory.  The running time won't look good, but at least it should
>> work.  Thanks for the response.
>>
>> Nick
>>
>> > From: charly.lizarralde@gmail.com
>> > Date: Tue, 30 Oct 2012 18:07:58 -0300
>> > Subject: Re: Converting one large text file with multiple documents to
>> SequenceFile format
>> > To: user@mahout.apache.org
>> >
>> > I had the exact same issue and I tried to use the seqdirectory command
>> with
>> > a different filter class but It did not work. It seems there's a bug in
>> the
>> > mahout-0.6 code.
>> >
>> > It ended up as writing a custom map-reduce program that performs just
>> that.
>> >
>> > Greetiings!
>> > Charly
>> >
>> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <levar1@hotmail.com>
>> wrote:
>> >
>> > >
>> > > I have done a lot of searching on the web for this, but I've found
>> > > nothing, even though I feel like it has to be somewhat common. I have
>> used
>> > > Mahout's 'seqdirectory' command to convert a folder containing text
>> files
>> > > (each file is a separate document) in the past. But in this case there
>> are
>> > > so many documents (in the 100,000s) that I have one very large text
>> file in
>> > > which each line is a document. How can I convert this large file to
>> > > SequenceFile format so that Mahout understands that each line should be
>> > > considered a separate document?  Would it be better if the file was
>> > > structured like so....docId1 {tab} document textdocId2 {tab} document
>> > > textdocId3 {tab} document text...
>> > >
>> > > Thank you very much for any help.Nick
>> > >
>>
>>



-- 
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy

Phone: +39 050 315 3055
Fax: +39 050 315 2040
________________________________________

Mime
View raw message