Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 78512 invoked from network); 7 Sep 2008 23:22:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Sep 2008 23:22:40 -0000 Received: (qmail 58767 invoked by uid 500); 7 Sep 2008 23:22:34 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 58723 invoked by uid 500); 7 Sep 2008 23:22:34 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 58712 invoked by uid 99); 7 Sep 2008 23:22:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Sep 2008 16:22:34 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of john.d.howland@gmail.com designates 72.14.204.235 as permitted sender) Received: from [72.14.204.235] (HELO qb-out-0506.google.com) (72.14.204.235) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Sep 2008 23:21:36 +0000 Received: by qb-out-0506.google.com with SMTP id e34so2052460qbe.35 for ; Sun, 07 Sep 2008 16:22:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=vu+S/QUoWgl8RNVMrG339PWu2h8I8Nx5XpS5FqWbPjQ=; b=GGlsKzGZSUV9/bptfuad4+zKDO7mO/6bYE2zBgCToSAWDviglExg1UjJ4u8chEH8I9 Ob9xml91N1DD8CbwarVnHR0sd0pcDEmS2HwG2hHtQCdhyeNKyfa4uEjIbjTSPVMdCE0W JXQnBdiewhU1N280AYDfyP/qeAZcKPgkLc+IY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=kD19OHyr7s9JfZC07nu2hn/zgdazqDJSgC/A06YMGz/iflouMfOIuuCGcDJa/0McvI 4P9A9kL5vvl++I1PGqrxAkVGyY03BtnNrMqUjip5eKnMTd+oTPfl48kCL6+BXa7oI5U9 14NGbHN8rLn5Ca/y7PL6gAZdW8p50SmG8lyl4= Received: by 10.187.202.7 with SMTP id e7mr2824309faq.37.1220829726919; Sun, 07 Sep 2008 16:22:06 -0700 (PDT) Received: by 10.187.221.16 with HTTP; Sun, 7 Sep 2008 16:22:06 -0700 (PDT) Message-ID: Date: Sun, 7 Sep 2008 19:22:06 -0400 From: "John Howland" To: core-user@hadoop.apache.org Subject: Getting started questions MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_40818_15916195.1220829726927" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_40818_15916195.1220829726927 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I've been reading up on Hadoop for a while now and I'm excited that I'm finally getting my feet wet with the examples + my own variations. If anyone could answer any of the following questions, I'd greatly appreciate it. 1. I'm processing document collections, with the number of documents ranging from 10,000 - 10,000,000. What is the best way to store this data for effective processing? - The bodies of the documents usually range from 1K-100KB in size, but some outliers can be as big as 4-5GB. - I will also need to store some metadata for each document which I figure could be stored as JSON or XML. - I'll typically filter on the metadata and then doing standard operations on the bodies, like word frequency and searching. Is there a canned FileInputFormat that makes sense? Should I roll my own? How can I access the bodies as streams so I don't have to read them into RAM all at once? Am I right in thinking that I should treat each document as a record and map across them, or do I need to be more creative in what I'm mapping across? 2. Some of the tasks I want to run are pure map operations (no reduction), where I'm calculating new metadata fields on each document. To end up with a good result set, I'll need to copy the entire input record + new fields into another set of output files. Is there a better way? I haven't wanted to go down the HBase road because it can't handle very large values (for the bodies) and it seems to make the most sense to keep the document bodies together with the metadata, to allow for the greatest locality of reference on the datanodes. 3. I'm sure this is not a new idea, but I haven't seen anything regarding it... I'll need to run several MR jobs as a pipeline... is there any way for the map tasks in a subsequent stage to begin processing data from previous stage's reduce task before that reducer has fully finished? Whatever insight folks could lend me would be a big help in crossing the chasm from the Word Count and associated examples to something more "real". A whole heap of thanks in advance, John ------=_Part_40818_15916195.1220829726927--