Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5AD9AEAF7 for ; Sat, 23 Feb 2013 19:40:36 +0000 (UTC) Received: (qmail 15164 invoked by uid 500); 23 Feb 2013 19:40:31 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 14894 invoked by uid 500); 23 Feb 2013 19:40:31 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 14887 invoked by uid 99); 23 Feb 2013 19:40:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Feb 2013 19:40:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of publicnetworkservices@gmail.com designates 209.85.216.44 as permitted sender) Received: from [209.85.216.44] (HELO mail-qa0-f44.google.com) (209.85.216.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Feb 2013 19:40:23 +0000 Received: by mail-qa0-f44.google.com with SMTP id bv4so935184qab.17 for ; Sat, 23 Feb 2013 11:40:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=EN4EyZUGyztmaQe4ggx8pNf5Qvp7OUR4Wr0YMBF3oDg=; b=0eSgjeu96oj7SIh2ND7EJ3fUMRJ9pda9iCCCIS6jPnFe17gqb0yN5fafYG9Io8Qnfm ZtuJiQ8jCBSWtA1N6kvcITk5FBMqjz0CRBfbzJF87OB2NAlB3rtUqzUxj6Z+i9/3LczF PKIIubsmnTEiZr3hiwZ05cXPwLuryrERUi/tGw2ccTfZlv+lCL5WOH1Q1Stj/cZpfto4 6uaYMM5zDy9RxR0MGCD3s3feIFwENQyIo3nltIETln91FFoxGCclZWkXyf5mEqXkjm+G k0GNnzXnTvTQJxjBa5VurPWOCOVFfUctlsHub+/eeNlI1ixF/24mpkY2CCSm99lNM9Nm bPYg== MIME-Version: 1.0 X-Received: by 10.229.76.138 with SMTP id c10mr843525qck.96.1361648402232; Sat, 23 Feb 2013 11:40:02 -0800 (PST) Received: by 10.49.76.41 with HTTP; Sat, 23 Feb 2013 11:40:02 -0800 (PST) In-Reply-To: References: Date: Sat, 23 Feb 2013 11:40:02 -0800 Message-ID: Subject: Re: Getting custom input splits from files that are not byte-aligned or line-aligned From: Public Network Services To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=002354470c4c53235204d669796c X-Virus-Checked: Checked by ClamAV on apache.org --002354470c4c53235204d669796c Content-Type: text/plain; charset=ISO-8859-1 This appears to be the case. My main issue is not reading the records (the library offers that functionality) but putting them to splits.after reading (option 2 in my original post). On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil < wellington.chevreuil@gmail.com> wrote: > Hi, > > I think you'll have to implement your own custom FileInputFormat, using > this lib you mentioned to properly read your file records and split them > through map tasks. > > Regards, > Wellington. > Em 23/02/2013 14:14, "Public Network Services" < > publicnetworkservices@gmail.com> escreveu: > > Hi... >> >> I use an application that processes text files containing data records >> which are of variable size and not line-aligned. >> >> The application implementation includes a Java library with a "reader" >> object that can extract records one-by-one in a "pull" fashion, as strings, >> i.e. for each such "reader" object the client code can call >> >> reader.next() >> >> >> and get an entire record as a String. So, proceeding in this fashion, the >> client code can consume a file of arbitrarily long length, from start to >> end, whereupon a null value is returned. >> >> Another peculiarity is that the extracted record strings may lose some >> secondary information (e.g., trim spaces), so exact byte alignment of the >> records to the underlying data is not possible. >> >> How could the above code be used to efficiently split compliant text >> files of large size (ranging from hundreds of megabytes to several >> gigabytes and terrabytes in size)? >> >> The source code I have seen in FileInputFormat and numerous other >> implementations is line or byte-aligned, so it is not applicable for the >> above case. >> >> It would actually be very useful if there was a template implementation >> that left only the string record "reader" object unspecified and did >> everything else, but apparently there is none. >> >> Two alternatives that should work are: >> >> 1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) >> and supply them to HDFS afterwards, returning false in the isSplitable() >> method of the custom InputFormat. >> 2. Read and write records into HDFS files in the getSplits[] method >> of the custom InputFormat and create one FileSplit reference for each of >> these HDFS files, once they are filled to the desired size. >> >> Is there any better approach and/or any example code relevant to the >> above? >> >> Thanks! >> > --002354470c4c53235204d669796c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable This appears to be the case.

My main issue is not readin= g the records (the library offers that functionality) but putting them to s= plits.after reading (option 2 in my original post).


On Sat, Feb 23, 2013 at 11:0= 5 AM, Wellington Chevreuil <wellington.chevreuil@gmail.com> wrote:

Hi,

I think you'll have to implement your own custom FileInp= utFormat, using this lib you mentioned to properly read your file records a= nd split them through map tasks.

Regards,
Wellington.

Em 23/02/2013 14:14, "Public Network Servic= es" <publicnetworkservices@gmail.com> escreveu:

Hi...

I use an application that processes text files con= taining data records which are of variable size and not line-aligned.
=

The application implementation includes a Java library = with a "reader" object that can extract records one-by-one in a &= quot;pull" fashion, as strings, i.e. for each such "reader" = object the client code can call

reader.next()

and= get an entire record as a String. So, proceeding in this fashion, the clie= nt code can consume a file of arbitrarily long length, from start to end, w= hereupon a null value is returned.

Another peculiarity is that the extracted record string= s may lose some secondary information (e.g., trim spaces), so exact byte al= ignment of the records to the underlying data is not possible.

How could the above code be used to efficiently split compli= ant text files of large size (ranging from hundreds of megabytes to several= gigabytes and terrabytes in size)?

The source code I have seen in FileInputFormat and nume= rous other implementations is line or byte-aligned, so it is not applicable= for the above case.

It would actually be very use= ful if there was a template implementation that left only the string record= "reader" object unspecified and did everything else, but apparen= tly there is none.

Two alternatives that should work are:
    Split the files outside Hadoop (e.g., to sizes less than 64 MB) and suppl= y them to HDFS afterwards, returning false in the isSplitable() method of t= he custom InputFormat.
  1. Read and write records into HDFS files in the getSplits[] method of the= custom InputFormat and create one FileSplit reference for each of these HD= FS files, once they are filled to the desired size.
Is there any better approach and/or any example code relevant to = the above?

Thanks!

--002354470c4c53235204d669796c--