Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 79212 invoked from network); 29 Apr 2008 16:32:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Apr 2008 16:32:05 -0000 Received: (qmail 42158 invoked by uid 500); 29 Apr 2008 16:32:03 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 42133 invoked by uid 500); 29 Apr 2008 16:32:03 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 42122 invoked by uid 99); 29 Apr 2008 16:32:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Apr 2008 09:32:03 -0700 X-ASF-Spam-Status: No, hits=3.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Apr 2008 16:31:08 +0000 Received: from 206.169.1.36 ([206.169.1.36]) by ex9.hostedexchange.local ([172.16.69.18]) with Microsoft Exchange Server HTTP-DAV ; Tue, 29 Apr 2008 16:31:29 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Tue, 29 Apr 2008 09:30:10 -0700 Subject: Re: Map/Reduce with XML files .. From: Ted Dunning To: Message-ID: Thread-Topic: Map/Reduce with XML files .. Thread-Index: AciqFkq9iXAQvxYJEd2ijQAWy8rVfQ== In-Reply-To: <447426.82636.qm@web38604.mail.mud.yahoo.com> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org https://issues.apache.org/jira/browse/HADOOP-3307 On 4/29/08 9:25 AM, "Kayla Jay" wrote: > Thanks. Do you have the jira issue number for that so that I can keep an eye > out on it? > > Thanks. > > > ----- Original Message ---- > From: Ted Dunning > To: core-user@hadoop.apache.org > Sent: Tuesday, April 29, 2008 12:07:32 PM > Subject: Re: Map/Reduce with XML files .. > > > Just adapt TextInput format so that it reads to the next file boundary > instead of the next new line. > > There is also a jira out for file archiving that would do all of this (and > more) for you. If you don't want to wait, then the mod to TIF is pretty > easy. > > > On 4/28/08 5:14 PM, "Kayla Jay" wrote: > >> Yes, I'm talking about a collection of small xml files stored in "container" >> files. I.e there's a lot and lots of small xml files collected into big >> files. Not one gargantuan XML file. How would you go about using hadoop with >> splits and processing and handling these sorts of XML files? >> >> >> ----- Original Message ---- >> From: Ted Dunning >> To: core-user@hadoop.apache.org >> Sent: Monday, April 28, 2008 4:16:20 PM >> Subject: Re: Map/Reduce with XML files .. >> >> >> The only real problem with xml and map-reduce is if you are talking about >> one gargantuan XML file. That makes correct splitting difficult. >> >> If you are talking about millions or billions of small xml files (stored in >> some sort of container file), then hadoop should be pretty easy to use. >> >> >> On 4/28/08 9:39 AM, "Kayla Jay" wrote: >> >>> Hello >>> >>> Has anyone had any experience with processing xml files within Hadoop within >>> their maps/reduces? >>> In particular, has anyone used any sort of XQuery/XPath processing within >>> their maps/reduces? >>> Say I have XML string passed to the map and now I want to find something in >>> particular via XQuery/XPath or some sort to run numbers on occurrences or >>> parse out a particular section within the XML. >>> >>> Anyone done any XML processing looking for things within XML? Then, >>> aggregate >>> common pieces together in the reduces ? >>> >>> >>> On another note, >>> Has anyone figured out splits for XML files? >>> Has anyone written a custom XML reader other than the StreamXmlRecordReader? >>> The only one I've read about and can find anything is: >>> http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html >>> >>> >>> Thanks. >>> >>> >>> >>> >>> > _____________________________________________________________________________>> > > _ >>> ______ >>> Be a better friend, newshound, and >>> know-it-all with Yahoo! Mobile. Try it now. >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >> >> >> >> _____________________________________________________________________________>> _ >> ______ >> Be a better friend, newshound, and >> know-it-all with Yahoo! Mobile. Try it now. >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > ______________________________________________________________________________ > ______ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ