accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <>
Subject Re: XML Storage - Accumulo or HDFS
Date Thu, 07 Jun 2012 02:19:56 GMT
If your XML documents are really just lists of elements/objects, and
what you want to run your analytics on are subsets of those elements
(even across XML documents), then it makes sense to take a document
store approach similar to what the Wikipedia example has done. This
allows you to index specific portions of elements, create graphs and
apply visibility labels to specific attributes in a given object tree.

On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
<> wrote:
> I can't think of any advantage to storing XML inside Accumulo. I am
> interested to learn some details about your view. Storing the
> extracted information and the location of the HDFS file that sourced
> the information does make sense to me. In fact, it might be useful to
> store file positions in Accumulo so it's easy to get back to specific
> spots in the XML file. For example, if you had an XML file with many
> records in it and there was no reason to immediately decompose each
> record.
> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <> wrote:
>> There are advantages to using Accumulo to store the contents of your
>> XML documents, depending on their structure and what you want to end
>> up taking out of them. Are you trying to emulate the document store
>> pattern that the Wikipedia example uses?
>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <> wrote:
>>> Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.
 I am running several different MapReduce jobs on the XML to pull out various pieces of data,
do analytics, etc.  I am using an XML input type based on the WikipediaInputFormat from the
examples.  What I have been doing is 1) loading the entire XML into HDFS as a single document
2) parsing the XML on some tag <foo> and storing each one of these instances as the
content of a new row in Accumulo, using the name of the instance as the row id.  I then run
other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do
with the data.
>>> My question is, is there any advantage to storing the XML in Accumulo versus
just leaving it in HDFS and parsing it from there?  Either as a large block of XML or as
individual chunks, perhaps  using Hadoop Archive to handle the small-file problem?  The
actual XML will not be queried in and of itself but is part other analysis processes.
>>> Thanks,
>>> Ralph
>>> __________________________________________________
>>> Ralph Perko
>>> Pacific Northwest National Laboratory

View raw message