Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7D924DD54 for ; Thu, 1 Nov 2012 00:08:16 +0000 (UTC) Received: (qmail 46202 invoked by uid 500); 1 Nov 2012 00:08:15 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 46171 invoked by uid 500); 1 Nov 2012 00:08:15 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 46160 invoked by uid 99); 1 Nov 2012 00:08:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 00:08:14 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of diego.ceccarelli@gmail.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qc0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 00:08:10 +0000 Received: by mail-qc0-f170.google.com with SMTP id d42so2698556qca.1 for ; Wed, 31 Oct 2012 17:07:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=Kirv7OBSek1dWYqtk7CeOZ7xRZRSRpJr6L5hlHwnmkY=; b=uUXMWZH0k9hnDry3pMwODcCDMCv+na2v2Va4it2ow6dxrtuEG63ahjJ/pcLynWL5BV TZJO/5seu10jE5ekbAIChJadMGrhlzO8Ft3gZ9UD8wW98aL39NO+VocVJiPRRvZt6XkZ 5fV/QTOyiBXyauwovwpnpkbKBGqHFhsnFF7WNWVAP0zk+IrKJGxwyraLa4G7PiWCbBbj qJwh51HPC7vdQeZEbzNkTt+63dmu993EzAAjwpvv6/rnKWGNJoiYFQQkB9GT9B7fvhNr ZgKQFG+2NG4ctd762CY5/j/R9KZb1Ny8b4bsMzL89tYSTW7LKvBSkANs/NWB8Yn967ds YSJQ== Received: by 10.224.31.20 with SMTP id w20mr22359211qac.3.1351728469329; Wed, 31 Oct 2012 17:07:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.49.76.68 with HTTP; Wed, 31 Oct 2012 17:07:29 -0700 (PDT) In-Reply-To: References: From: Diego Ceccarelli Date: Thu, 1 Nov 2012 01:07:29 +0100 Message-ID: Subject: Re: Converting one large text file with multiple documents to SequenceFile format To: user@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hei Nick, I had exatly the same problem ;) I wrote a simple command line utility to create a sequence file where each line of the input document is an entry (the key is the line number). https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI -input tweets -output tweets.seq enjoy ;) Diego On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde wrote: > I don't think you need that. Just a simple mapper. > > static class IdentityMapper extends Mapper > { > > @Override > protected void map(LongWritable key, Text value, Context context) > throws IOException, InterruptedException { > > String[] fields = value.toString().split("\t") ; > if ( fields.length >= 2) { > context.write(new Text(fields[0]), new Text(fields[1])) > ; > } > > } > > } > > and then run a simple job.. > > Job text2SequenceFileJob = this.prepareJob(this.getInputPath(), > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class, > Text.class, Text.class, SequenceFileOutputFormat.class) ; > > text2SequenceFileJob.setOutputKeyClass(Text.class) ; > text2SequenceFileJob.setOutputValueClass(Text.class) ; > text2SequenceFileJob.setNumReduceTasks(0) ; > > text2SequenceFileJob.waitForCompletion(true) ; > > Cheers! > Charly > > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward wrote: > >> >> Yeah, I've looked at filter classes, but nothing worked. I guess I'll do >> something similar and continuously save each line into a file and then run >> seqdiretory. The running time won't look good, but at least it should >> work. Thanks for the response. >> >> Nick >> >> > From: charly.lizarralde@gmail.com >> > Date: Tue, 30 Oct 2012 18:07:58 -0300 >> > Subject: Re: Converting one large text file with multiple documents to >> SequenceFile format >> > To: user@mahout.apache.org >> > >> > I had the exact same issue and I tried to use the seqdirectory command >> with >> > a different filter class but It did not work. It seems there's a bug in >> the >> > mahout-0.6 code. >> > >> > It ended up as writing a custom map-reduce program that performs just >> that. >> > >> > Greetiings! >> > Charly >> > >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward >> wrote: >> > >> > > >> > > I have done a lot of searching on the web for this, but I've found >> > > nothing, even though I feel like it has to be somewhat common. I have >> used >> > > Mahout's 'seqdirectory' command to convert a folder containing text >> files >> > > (each file is a separate document) in the past. But in this case there >> are >> > > so many documents (in the 100,000s) that I have one very large text >> file in >> > > which each line is a document. How can I convert this large file to >> > > SequenceFile format so that Mahout understands that each line should be >> > > considered a separate document? Would it be better if the file was >> > > structured like so....docId1 {tab} document textdocId2 {tab} document >> > > textdocId3 {tab} document text... >> > > >> > > Thank you very much for any help.Nick >> > > >> >> -- Computers are useless. They can only give you answers. (Pablo Picasso) _______________ Diego Ceccarelli High Performance Computing Laboratory Information Science and Technologies Institute (ISTI) Italian National Research Council (CNR) Via Moruzzi, 1 56124 - Pisa - Italy Phone: +39 050 315 3055 Fax: +39 050 315 2040 ________________________________________