Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 294C19C1A for ; Sun, 12 Feb 2012 20:30:42 +0000 (UTC) Received: (qmail 59248 invoked by uid 500); 12 Feb 2012 20:30:38 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 59193 invoked by uid 500); 12 Feb 2012 20:30:37 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 59185 invoked by uid 99); 12 Feb 2012 20:30:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Feb 2012 20:30:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mohitanchlia@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Feb 2012 20:30:31 +0000 Received: by dadp13 with SMTP id p13so4768284dad.35 for ; Sun, 12 Feb 2012 12:30:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MyJqDvHtR086zWc+aL57scqecl7qCOc5F/Rilyg2hxE=; b=SzSq7AH5EuDpvZ6yO5oz1I4UQisTexw3hfZOHUjAJQgAMyzYXwvWBMJCaayn//QXr0 YzcYKweKDfSDfDp/B16LMJYZSRvR/TO4xCXoQmXH6G9eddehD6km1/SNoIO+6yZwHE+4 Fws8BAzWUCdxqH0+xi2S7vhjuDOgJV2my1Mng= MIME-Version: 1.0 Received: by 10.68.74.167 with SMTP id u7mr39946022pbv.103.1329078610981; Sun, 12 Feb 2012 12:30:10 -0800 (PST) Received: by 10.68.236.232 with HTTP; Sun, 12 Feb 2012 12:30:10 -0800 (PST) In-Reply-To: References: Date: Sun, 12 Feb 2012 12:30:10 -0800 Message-ID: Subject: Re: Processing small xml files From: Mohit Anchlia To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill wrote: > I've used the Mahout XMLInputFormat. It is the right tool if you have an > XML file with one type of section repeated over and over again and want to > turn that into Sequence file where each repeated section is a value. I've > found it helpful as a preprocessing step for converting raw XML input into > something that can be handled by Hadoop jobs. Thanks for the input. Do you first convert it into flat format and then run another hadoop job or do you just read xml sequence file and then perform reduce on that. Is there an advantage of first converting it into a flat file format? > > If you're worried about having lots of small files--specifically, about > overwhelming your namenode because you have too many small > files--the XMLInputFormat won't help with that. However, it may be possible > to concatenate the small files into larger files, then have a Hadoop job > that uses XMLInputFormat transform the large files into sequence files. How many are too many for namenode? We have around 100M files and 100M files every year