Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 72D3510F73 for ; Fri, 24 Jan 2014 22:43:44 +0000 (UTC) Received: (qmail 79764 invoked by uid 500); 24 Jan 2014 22:43:16 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 79655 invoked by uid 500); 24 Jan 2014 22:43:16 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 79644 invoked by uid 99); 24 Jan 2014 22:43:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 22:43:16 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vinodkv@hortonworks.com designates 209.85.220.50 as permitted sender) Received: from [209.85.220.50] (HELO mail-pa0-f50.google.com) (209.85.220.50) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 22:43:10 +0000 Received: by mail-pa0-f50.google.com with SMTP id kp14so3776805pab.37 for ; Fri, 24 Jan 2014 14:42:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :message-id:references:to:content-type; bh=xLWMLEUFG24DhedJvTJc5gfRHFI0go/z2Np4uLfxWNs=; b=T5ke4i7fk13TbZkQ2/hmBqb0aatZ3TtrHwhiZ4/SkD2H31qPx4S/nV1rPojKeJWmNi QVufAvoZpmrft+v1H29LKUQLKgjX6ndRiqPyAPTmGl4E6aSV1C115VK+e5eQJn4eKhy9 qndul9YPuDYHGjmybk2uzUmHSF0x96MMGosXuuAQm2LpFg7eeXE+rlbBFqTiREiKWoBB KPOjL0kCFIPNPMIkF8YP8wUanQqM/jKSUUQiRptLXi7QQaVWmu513iMp+gLl/XDmrIyd E9H1tCmNP73lt/x5xX3Ejrc9ffe7xUL/XZmGLz+o2BmVJ82Oina+jf+NqO3P0oTDIfoq lfzQ== X-Gm-Message-State: ALoCoQmaeFvziQ2nkihP1v2jhYeWp0BbcYyaW9zFuIUIRXOSIAX5GXYhBOPtIId/G7HFOGnc7x7URbJnb/BWtgkyBcJaD3+dqZVuetr9ekQ1Jem/aG63bjc= X-Received: by 10.66.154.75 with SMTP id vm11mr16912732pab.124.1390603368772; Fri, 24 Jan 2014 14:42:48 -0800 (PST) Received: from [10.11.2.123] ([192.175.27.2]) by mx.google.com with ESMTPSA id jp3sm6559177pbc.36.2014.01.24.14.42.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 24 Jan 2014 14:42:48 -0800 (PST) Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: Re: Memory problems with BytesWritable and huge binary files From: Vinod Kumar Vavilapalli In-Reply-To: Date: Fri, 24 Jan 2014 14:42:45 -0800 Message-Id: <8BF502F7-16B6-482F-B6BA-2985A06601CD@hortonworks.com> References: To: user@hadoop.apache.org X-Mailer: Apple Mail (2.1827) Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it. +Vinod On Jan 24, 2014, at 12:44 PM, Adam Retter wrote: >> Is your data in any given file a bunch of key-value pairs? > > No. The content of each file itself is the value we are interested in, > and I guess that it's filename is the key. > >> If that isn't the >> case, I'm wondering how writing a single large key-value into a sequence >> file helps. It won't. May be you can give an example of your input data? > > Well from the Hadoop O'Reilly book, I rather got the impression that > HDFS does not like small files due to it's 64MB block size, and it is > instead recommended to place small files into a Sequence file. Is that > not the case? > > Our input data really varies between 130 different file types, it > could be Microsoft Office documents, Video Recordings, Audio, CAD > diagrams etc. > >> If indeed they are a bunch of smaller sized key-value pairs, you can write >> your own custom InputFormat that reads the data from your input files one >> k-v pair after another, and feed it to your MR job. There isn't any need for >> converting them to sequence-files at that point. > > As I mentioned in my initial email, each file cannot be split up! > >> Thanks >> +Vinod >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader of >> this message is not the intended recipient, you are hereby notified that any >> printing, copying, dissemination, distribution, disclosure or forwarding of >> this communication is strictly prohibited. If you have received this >> communication in error, please contact the sender immediately and delete it >> from your system. Thank You. > > > > -- > Adam Retter > > skype: adam.retter > tweet: adamretter > http://www.adamretter.org.uk -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.