accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perko, Ralph J" <>
Subject XML Storage - Accumulo or HDFS
Date Wed, 06 Jun 2012 20:20:39 GMT
Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.  I am running several
different MapReduce jobs on the XML to pull out various pieces of data, do analytics, etc.
 I am using an XML input type based on the WikipediaInputFormat from the examples.  What I
have been doing is 1) loading the entire XML into HDFS as a single document 2) parsing the
XML on some tag <foo> and storing each one of these instances as the content of a new
row in Accumulo, using the name of the instance as the row id.  I then run other MR jobs that
scan this table, pull out and parse the XML and do whatever I need to do with the data.

My question is, is there any advantage to storing the XML in Accumulo versus just leaving
it in HDFS and parsing it from there?  Either as a large block of XML or as individual chunks,
perhaps  using Hadoop Archive to handle the small-file problem?  The actual XML will not be
queried in and of itself but is part other analysis processes.


Ralph Perko
Pacific Northwest National Laboratory

View raw message