Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DEF006D46 for ; Tue, 24 May 2011 23:25:58 +0000 (UTC) Received: (qmail 33905 invoked by uid 500); 24 May 2011 23:25:56 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 33864 invoked by uid 500); 24 May 2011 23:25:56 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 33856 invoked by uid 99); 24 May 2011 23:25:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 May 2011 23:25:56 +0000 X-ASF-Spam-Status: No, hits=4.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.91.222] (HELO nm22-vm0.bullet.mail.sp2.yahoo.com) (98.139.91.222) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 24 May 2011 23:25:47 +0000 Received: from [98.139.91.69] by nm22.bullet.mail.sp2.yahoo.com with NNFMP; 24 May 2011 23:25:26 -0000 Received: from [98.139.91.9] by tm9.bullet.mail.sp2.yahoo.com with NNFMP; 24 May 2011 23:25:26 -0000 Received: from [127.0.0.1] by omp1009.mail.sp2.yahoo.com with NNFMP; 24 May 2011 23:25:26 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 711492.73964.bm@omp1009.mail.sp2.yahoo.com Received: (qmail 7078 invoked by uid 60001); 24 May 2011 23:25:26 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1306279526; bh=lZ8EzATBanztijo6+f4RWpjCwDh58pIyREGH5rqMw/E=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=oIjywlmiv1KrSBNLS1er6M+ORk6mrVL5Rduu6nnBM5VBl2To6L6RsSs/Ld7IPHadnC3xI5JLtfXV1I7eEWUHl9llotuoEM7nOiniF4pMETs71drZ4r3OhKHpqhRKpHuzBMiT0m2nKA3zKtxp5N85KZMUDxrundGUllbQZe7FoHc= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=GOC1RVb4M3EvoJKuOlorz872JmKBrj9Zzw9Fwy5evWyw6SypjMKuu8sEBq03sqszB7eEEKPTVnuhkM2W0VEnFq1PJnWz8rHe8QmEsC0wg++jU3nzg1YhUdx+3qKyARJ1BDaMn3YIkVn2VE2cQE47h0hNHWMze83GmxsyjNKlYlA=; Message-ID: <442174.75105.qm@web45603.mail.sp1.yahoo.com> X-YMail-OSG: 5TrVJ_YVM1m7PXLCQ83vxWpKRVV_9cUskVeqbQfFEnbThMm H8.K6YoitkwdYKKzz7hTQ7gBXDOfb19qH42eCZ7q5eGLcCrdXvJQiDBSoga1 r41CLuSSS.zOixpx9bFzR1gSyDfy0008zKd64hccDwnrEl3kr1pC9BIi.t8O lVvfk5rR5kRCryCq6hsMoa2aYGc3pyWYjw3nwRdmHxlllq8YtJaGDgc_HdEU 8o06lPNmqOq4zHgbHJKyzC8p5iVqAr9Qo1zKUM_vpY2pVNN4iEaeJl944y1R 2m0Lus6HIvZGOYQdGp5q0zQ4qtsI7y2DUHRHpF5r.QiLUorbLQ3_AoVRVi2h nR4OMjVugNem9Zt8Sqo7y90GqlHY- Received: from [63.124.22.144] by web45603.mail.sp1.yahoo.com via HTTP; Tue, 24 May 2011 16:25:26 PDT X-Mailer: YahooMailClassic/14.0.1 YahooMailWebService/0.8.111.303096 Date: Tue, 24 May 2011 16:25:26 -0700 (PDT) From: Aleksandr Elbakyan Subject: Re: Processing xml files To: common-user@hadoop.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-385511885-1306279526=:75105" --0-385511885-1306279526=:75105 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hello, =A0We have the same type of data, we currently convert it to tab delimited = file and use it as input for streaming Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia wrote: From: Mohit Anchlia Subject: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:16 PM I just started learning hadoop and got done with wordcount mapreduce example. I also briefly looked at hadoop streaming. Some questions 1) What should=A0 be my first step now? Are there more examples somewhere that I can try out? 2) Second question is around pracitcal usability using xml files. Our xml files are not big they are around 120k in size but hadoop is really meant for big files so how do I go about processing these xml files? 3) Are there any samples or advise on how to processing with xml files? Looking for help and pointers. --0-385511885-1306279526=:75105--