Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 46856 invoked from network); 6 May 2009 01:06:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 May 2009 01:06:58 -0000 Received: (qmail 37437 invoked by uid 500); 6 May 2009 01:06:56 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 37339 invoked by uid 500); 6 May 2009 01:06:56 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 37329 invoked by uid 99); 6 May 2009 01:06:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2009 01:06:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cenyongh@gmail.com designates 209.85.142.184 as permitted sender) Received: from [209.85.142.184] (HELO ti-out-0910.google.com) (209.85.142.184) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2009 01:06:45 +0000 Received: by ti-out-0910.google.com with SMTP id 28so526049tif.9 for ; Tue, 05 May 2009 18:06:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=O54X61JRMSIVf3EUFjiegxixcgEeeLrkcjsWa5RPmV0=; b=njJQe0ZdnVX6lldq1TGhau9ucARu11dgRYI/ddi5+S5WaM0wQwa8Sm7VPHc5rL3RTy lXwUJWx/mfX+6JcGXVqMu+unDiz2UtFezFWsDuBFugQrnM7Bm1dkBy76Tt16gL9m4rh3 81fyEU4HTodvK7315NDh2754Ox4YVy+mzA+i0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=AoLjo/nZe621NM1T0PZfk5BDVxeEanMa+3tBYH1UGIh/79rpRwv6QQCiDbEBmEvQKR 1L58uSfMROeh2VBlUdnvnamdXZbhen3qKPE7JY9hLdMXiS6qK0uPD3jGSYEzUDYgpPW4 NGJzEMqAPAfgDT4a4Fb56O5x+RqmX60sMgnyk= MIME-Version: 1.0 Received: by 10.110.47.17 with SMTP id u17mr38856tiu.41.1241571982441; Tue, 05 May 2009 18:06:22 -0700 (PDT) In-Reply-To: <9D634D5D-E49C-4BFA-993F-4778AB786D2D@indiana.edu> References: <9D634D5D-E49C-4BFA-993F-4778AB786D2D@indiana.edu> Date: Wed, 6 May 2009 09:06:22 +0800 Message-ID: Subject: Re: multi-line records and file splits From: Nick Cen To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e652f56af9fcd70469340024 X-Virus-Checked: Checked by ClamAV on apache.org --0016e652f56af9fcd70469340024 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit I think your SDFInputFormat should implement the MultiFileInputFormat instead of the TextInputFormat, which will not splid the file into chunk. 2009/5/6 Rajarshi Guha > Hi, I have implemented a subclass of RecordReader to handle a plain text > file format where a record is multi-line and of variable length. > Schematically each record is of the form > > some_title > foo > bar > $$$$ > another_title > foo > foo > bar > $$$$ > > where $$$$ is the marker for the end of the record. My code is at > http://blog.rguha.net/?p=293 and it seems to work fine on my input data. > > However, I realized that when I run the program, Hadoop will 'chunk' the > input file. As a result, the SDFRecordReader might get a chunk of input > text, such that the last record is actually incomplete (a missing $$$$). Is > this correct? > > If so, how would the RecordReader implementation recover from this > situation? Or is there a way to indicate to Hadoop that the input file > should be chunked keeping in mind end of record delimiters? > > Thanks > > ------------------------------------------------------------------- > Rajarshi Guha > GPG Fingerprint: D070 5427 CC5B 7938 929C DD13 66A1 922C 51E7 9E84 > ------------------------------------------------------------------- > Q: What's polite and works for the phone company? > A: A deferential operator. > > > -- http://daily.appspot.com/food/ --0016e652f56af9fcd70469340024--