Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 79978 invoked from network); 3 Nov 2009 00:01:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Nov 2009 00:01:33 -0000 Received: (qmail 83570 invoked by uid 500); 3 Nov 2009 00:01:31 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 83507 invoked by uid 500); 3 Nov 2009 00:01:30 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 83497 invoked by uid 99); 3 Nov 2009 00:01:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 00:01:29 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of amansk@gmail.com designates 209.85.216.198 as permitted sender) Received: from [209.85.216.198] (HELO mail-px0-f198.google.com) (209.85.216.198) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 00:01:27 +0000 Received: by pxi36 with SMTP id 36so2380678pxi.2 for ; Mon, 02 Nov 2009 16:01:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=WU0/P0QZ89DZVwPE8PSzyo8XJEzDmlKEELklPAzQBsU=; b=mWPB2itZZUrVja9BLkDH54SeGr6yp6QvPTX8vvbOoFNIuX1IElQlcq568XGSttIE9U Z1e6zh3W438k0ZzJBv0dhGlaBhPgLXH63JxBR7waVuivllgNV/cQa73ZEiy2jGnH9eeC nkykAwqNK75t5yHwpbUW3fDZhn5aos9F7kcNI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=h23rpnPkbyjipmeC5mig9O+0cGgG7kMbWS9/ao5sJS5upGDgj4ELv5kGdiphsFEgAN Tf2FruAt4c8bmXs+w2p2B4DME5H4+q//RjSmhjOTgxpkfeznFOwh3mNtryIhp6svZkMO D4XbWNIrd2+MkTkstYkz+UDPeDMgzePDHiAoQ= MIME-Version: 1.0 Received: by 10.141.45.15 with SMTP id x15mr386765rvj.215.1257206467045; Mon, 02 Nov 2009 16:01:07 -0800 (PST) In-Reply-To: <7fe3e3cb0911021501l2a532763w89a3273e1342dbeb@mail.gmail.com> References: <7fe3e3cb0911021501l2a532763w89a3273e1342dbeb@mail.gmail.com> From: Amandeep Khurana Date: Mon, 2 Nov 2009 16:00:47 -0800 Message-ID: <35a22e220911021600l23a72aabjeee394333e5cdd2d@mail.gmail.com> Subject: Re: XML input to map function To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd29f52e0ac2d04776c30e0 --000e0cd29f52e0ac2d04776c30e0 Content-Type: text/plain; charset=ISO-8859-1 Are the xml's in flat files or stored in Hbase? 1. If they are in flat files, you can use the StreamXmlRecordReader if that works for you. 2. Or you can read the xml into a single string and process it however you want. (This can be done if its in a flat file or stored in an hbase table). I have xmls in hbase table and parse and process them as strings. One mapper per file doesnt make sense. If its in HBase, have one mapper per region. If they are flat files, depending on how many files you have, you can create mappers. You can tune this for your particular requirement and there is no "right" way to do it. On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma wrote: > I am working on a mapreduce application that will take input from lots of > small xml files rather than one big xml file. Each xml files has some > record > that I want to parse and input data in a hbase table. How should I go about > parsing xml files and input in map functions. Should I have one mapper per > xml file or is there another way of doing this? Thanks for your help and > time. > > Regards, > Vipul Sharma, > --000e0cd29f52e0ac2d04776c30e0--