Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 96045 invoked from network); 14 Sep 2009 13:41:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Sep 2009 13:41:29 -0000 Received: (qmail 95182 invoked by uid 500); 14 Sep 2009 13:41:27 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 95084 invoked by uid 500); 14 Sep 2009 13:41:27 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 95074 invoked by uid 99); 14 Sep 2009 13:41:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Sep 2009 13:41:26 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.204] (HELO mail-yx0-f204.google.com) (209.85.210.204) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Sep 2009 13:41:15 +0000 Received: by yxe42 with SMTP id 42so4067652yxe.31 for ; Mon, 14 Sep 2009 06:40:53 -0700 (PDT) Received: by 10.101.194.8 with SMTP id w8mr6167933anp.190.1252935652967; Mon, 14 Sep 2009 06:40:52 -0700 (PDT) Received: from ?10.66.66.5? ([67.201.160.103]) by mx.google.com with ESMTPS id 20sm2897ywh.2.2009.09.14.06.40.45 (version=SSLv3 cipher=RC4-MD5); Mon, 14 Sep 2009 06:40:52 -0700 (PDT) Message-ID: <4AAE47D7.4080905@chaeron.com> Date: Mon, 14 Sep 2009 09:40:39 -0400 From: Andrzej Jan Taramina Organization: Chaeron Corporation User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Processing a large quantity of smaller XML files? X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I'm new to Hadoop, so pardon the potentially dumb question.... I've gathered, from much research, that Hadoop is not always a good choice when you need to process a whack of smaller files, which is what we need to do. More specifically, we need to start by processing about 250K XML files, each of which is in the 50K - 2M range, with an average size of 100K bytes. The processing we need to do on each file is pretty CPU-intensive, with a lot of pattern matching. What we need to do would fall nicely into the Map/Reduce paradigm. Over time, the volume of files will grow by an order of magnitude into the range of millions of files, hence the desire to use a mapred distributed cluster to do the analysis we need. Normally, one could just concatenate the XML files into bigger input files. Unfortunately, one of our constrains is that a certain percentage of these XML files will change every night, and so we need to be able to update the Hadoop data store (HDFS perhaps) on a regular basis. This would be difficult if the files are all concatenated. The XML data originally comes from a number of XML databases. Any advice/suggestions on the best way to structure our data storage of all the XML files so that Hadoop would run efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still conveniently update the changed files on a nightly basis? Much appreciated! -- Andrzej Taramina Chaeron Corporation: Enterprise System Solutions http://www.chaeron.com