Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B8B3EDDFF for ; Wed, 16 Jan 2013 15:39:38 +0000 (UTC) Received: (qmail 83165 invoked by uid 500); 16 Jan 2013 15:39:34 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 83070 invoked by uid 500); 16 Jan 2013 15:39:33 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 83057 invoked by uid 99); 16 Jan 2013 15:39:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jan 2013 15:39:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates 209.85.212.49 as permitted sender) Received: from [209.85.212.49] (HELO mail-vb0-f49.google.com) (209.85.212.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jan 2013 15:39:27 +0000 Received: by mail-vb0-f49.google.com with SMTP id s24so2705vbi.8 for ; Wed, 16 Jan 2013 07:39:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=ZFg8qdjV3UJ+xczhgL8fa1TIELb/Cf6ERe5Df94nifs=; b=HzK+heTyVdUZcBieubS/cmTNknsyJwwuftAu/fjXr4GM1eFzpaqIgdeCxCnG5Jtjh7 IdBXJotJ5Hi0g5CmCm9d45RcymkbcDAzUyWDLVu8+nlSJY1U9SZBsg9U4s/Urg+kGo4h aHXf/YHZLJje4zHEyVBTmeMZB8vbugxdKPejzP8clKLkPXSs++jg0HBrt7wth1c9sca4 oQGgMU/h6oWESOTHEyjl3O3BHeYFXT5pbp0A5vUrBHRLorsVH2EEjBm+YXt/u25RYivc R5Y0yA3pbqDycDLHwmB5FUw90Mhu4jfNpy6BmVwO67jX2uV9gTyxJ4tzZwd5lUZpzzEB lX2A== X-Received: by 10.220.153.201 with SMTP id l9mr1601897vcw.33.1358350746611; Wed, 16 Jan 2013 07:39:06 -0800 (PST) MIME-Version: 1.0 Received: by 10.58.34.16 with HTTP; Wed, 16 Jan 2013 07:38:26 -0800 (PST) In-Reply-To: References: From: Mohammad Tariq Date: Wed, 16 Jan 2013 21:08:26 +0530 Message-ID: Subject: Re: Loading file to HDFS with custom chunk structure To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=f46d04339ce4bb718d04d369adb7 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04339ce4bb718d04d369adb7 Content-Type: text/plain; charset=ISO-8859-1 Hello there, You don't have to split the file. When you push anything into HDFS, it automatically gets splitted into small chunks of uniform size (64MB or 128MB usually). And the MapReduce frameworks ensures that each block is processed locally on the node where it is located. Do you have any specific requirement?? Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, Jan 16, 2013 at 9:01 PM, Kaliyug Antagonist < kaliyugantagonist@gmail.com> wrote: > I want to load a SegY file onto HDFS > of a 3-node Apache Hadoop cluster. > > To summarize, the SegY file consists of : > > 1. 3200 bytes *textual header* > 2. 400 bytes *binary header* > 3. Variable bytes *data* > > The 99.99% size of the file is due to the variable bytes data which is > collection of thousands of contiguous traces. For any SegY file to make > sense, it must have the textual header+binary header+at least one trace of > data. What I want to achieve is to split a large SegY file onto the Hadoop > cluster so that a smaller SegY file is available on each node for local > processing. > > The scenario is as follows: > > 1. The SegY file is large in size(above 10GB) and is resting on the > local file system of the NameNode machine > 2. The file is to be split on the nodes in such a way each node has a > small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes > *binary header* + variable bytes *data*As obvious, I can't blindly use > FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the > format in which the chunks of the larger file are required > > Please guide me as to how I must proceed. > > Thanks and regards ! > --f46d04339ce4bb718d04d369adb7 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello there,

=A0 =A0 You don'= t have to split the file. When you push anything into=A0
HD= FS,=A0it automatically gets splitted into small chunks of uniform=A0
<= div style> size=A0(64MB or 128MB usually). And the MapReduce frameworks
ensures that each block is processed locally on the node where it
is located.

Do you have any s= pecific requirement??



On Wed, Jan 16, 2013 at 9:01 PM, Kaliyug= Antagonist <kaliyugantagonist@gmail.com> wrote:

I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster.

To summari= ze, the SegY file consists of :

  1. 3200 bytes textual header
  2. 400 bytes binary header
  3. Variable bytes data
  4. <= /ol>

    The 99.99% size of the file is due to the variable bytes data which = is collection of thousands of contiguous traces. For any SegY file to make = sense, it must have the textual header+binary header+at least one trace of = data. What I want to achieve is to split a large SegY file onto the Hadoop = cluster so that a smaller SegY file is available on each node for local pro= cessing.

    The scenario is as follows:

    1. The SegY file is large in size(ab= ove 10GB) and is resting on the local file system of the NameNode machine
    2. The file is to be split on the nodes in such a way each node has a s= mall SegY file with a strict structure - 3200 bytes textual header<= /strong> + 400 bytes binary header + variable bytes dataAs obvious, I can't blindly use FSDataOutputStream or ha= doop fs -copyFromLocal as this may not ensure the format in which the chunk= s of the larger file are required

    Please guide me as to how I must proceed.

    Thanks and regards = !


--f46d04339ce4bb718d04d369adb7--