Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1611D0FA for ; Tue, 30 Oct 2012 21:08:50 +0000 (UTC) Received: (qmail 95134 invoked by uid 500); 30 Oct 2012 21:08:49 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 95062 invoked by uid 500); 30 Oct 2012 21:08:49 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 95054 invoked by uid 99); 30 Oct 2012 21:08:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2012 21:08:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of charly.lizarralde@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2012 21:08:41 +0000 Received: by mail-we0-f170.google.com with SMTP id x10so637510wey.1 for ; Tue, 30 Oct 2012 14:08:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=7DMDBruS2hVfd1cIPgzWpP/5tG2oVjJVLgwvr0wSyso=; b=oNXIeZjTZaQg+XWJ7kb9PstB1P2S5UFApK4z3UzxF/8v/Ru/99LIjqcMhRPC4bEMdD NerVgtmoSGT1VaRTqgolBedyPqB1vtUCoiK9uftlRM4sLi9JGWwLJih49glqBTZUcxfw yz1lsqqADPR96/mOgoEdCh8JZ1Hlx2uf35U0UNXpUAduziv1CzJ5HLJwMnPeNE+vlh7P 1NuvhDxuqLxviV3i3XmHfd+rK2Am3wH0E8gb3kdywkMZSm4ppE19wEIutcVHfWQMtTmI TaxjVJi+7LqwkuAqUVQEUY3xjszvKqJjCNYy9NrjzfeZFemIYFvHp3wqlO9tVpm4txSV DhkA== Received: by 10.180.8.197 with SMTP id t5mr5003812wia.5.1351631298632; Tue, 30 Oct 2012 14:08:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.92.10 with HTTP; Tue, 30 Oct 2012 14:07:58 -0700 (PDT) In-Reply-To: References: From: Charly Lizarralde Date: Tue, 30 Oct 2012 18:07:58 -0300 Message-ID: Subject: Re: Converting one large text file with multiple documents to SequenceFile format To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=f46d044282406c1ba004cd4d2f34 X-Virus-Checked: Checked by ClamAV on apache.org --f46d044282406c1ba004cd4d2f34 Content-Type: text/plain; charset=ISO-8859-1 I had the exact same issue and I tried to use the seqdirectory command with a different filter class but It did not work. It seems there's a bug in the mahout-0.6 code. It ended up as writing a custom map-reduce program that performs just that. Greetiings! Charly On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward wrote: > > I have done a lot of searching on the web for this, but I've found > nothing, even though I feel like it has to be somewhat common. I have used > Mahout's 'seqdirectory' command to convert a folder containing text files > (each file is a separate document) in the past. But in this case there are > so many documents (in the 100,000s) that I have one very large text file in > which each line is a document. How can I convert this large file to > SequenceFile format so that Mahout understands that each line should be > considered a separate document? Would it be better if the file was > structured like so....docId1 {tab} document textdocId2 {tab} document > textdocId3 {tab} document text... > > Thank you very much for any help.Nick > --f46d044282406c1ba004cd4d2f34--