hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Vargas <br...@ardvaark.net>
Subject Re: Working with XML / XQuery in hadoop
Date Mon, 23 Jun 2008 20:22:32 GMT
Hash: RIPEMD160


When I first started playing with Hadoop, I created an InputFormat and
RecordReader that, given an XML file, created a series of key-value
pairs where the XPath of the node in the document was the key and the
value of the node (if it had one) was the value.  At the time, it seemed
like a good idea, but turned out to be horribly slow, due to the insane
number of keys that were created.  It also sucked to code against.

It turned out to be way faster, and way easier to code, to just pass in
the name of the files to be loaded and run them through your favorite
parsing implementation within the Map implementation.  Alternatively, if
the files are small enough, you could load the XML bytes into a sequence
file, and then just read them out as BytesWritable - again, into your
favorite parser.  (In fact, if you're dealing with XML files below the
block size of HDFS, that's probably the better way to do it.)


Kayla Jay wrote:
| Hi
| Just wondering if anyone out there works with and manipulates and
| stores XML data using Hadoop?  I've seen some threads about XML
| RecordReaders and people who use that XML StreamXmlRecordReader to do
| splits.  But, has anyone implemented a query framework that will use
| the hadoop layer to query against the XML in their map/reduce jobs?
| I want to know if anyone has done an XQuery or XPath executed within
| a haoop job to find something within the XML stored in hadoop?
| I can't find any samples or anyone else out there who uses XML data
| vs. traditional log text data.
| Are there any use cases of using hadoop to work with XML and then do
| queries against XML in a distributed manner using hadoop?
| Thanks.
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net


View raw message