accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: XML Storage - Accumulo or HDFS
Date Thu, 07 Jun 2012 02:50:23 GMT
+1, Bill. Assuming you aren't doing anything crazy in your XML files, 
the wikipedia example should get you pretty far. That being said, the 
structure used in the wikipedia example doesn't handle large lists of 
elements -- short explanation: an attribute of a document is stored as 
one key-vale pair, so if you have lot of large lists, you inflate the 
key which does bad things. That in mind, there are small changes you can 
make to the table structure to store those lists more efficiently and 
still maintain the semantic representation (Bill's graph comment).

David, ignoring any issues of data locality of the blocks in your large 
XML files, storing byte offsets into a hierarchical data structure (XML) 
seems like a sub-optimal solution to me. Aside from losing the hierarchy 
knowledge, if you have a skewed distribution of elements in the XML 
document, you can't get good locality in your query/analytic. What was 
your idea behind storing the offsets?

- Josh

On 6/6/2012 10:19 PM, William Slacum wrote:
> If your XML documents are really just lists of elements/objects, and
> what you want to run your analytics on are subsets of those elements
> (even across XML documents), then it makes sense to take a document
> store approach similar to what the Wikipedia example has done. This
> allows you to index specific portions of elements, create graphs and
> apply visibility labels to specific attributes in a given object tree.
> On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
> <>  wrote:
>> I can't think of any advantage to storing XML inside Accumulo. I am
>> interested to learn some details about your view. Storing the
>> extracted information and the location of the HDFS file that sourced
>> the information does make sense to me. In fact, it might be useful to
>> store file positions in Accumulo so it's easy to get back to specific
>> spots in the XML file. For example, if you had an XML file with many
>> records in it and there was no reason to immediately decompose each
>> record.
>> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum<>  wrote:
>>> There are advantages to using Accumulo to store the contents of your
>>> XML documents, depending on their structure and what you want to end
>>> up taking out of them. Are you trying to emulate the document store
>>> pattern that the Wikipedia example uses?
>>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J<>  wrote:
>>>> Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.
 I am running several different MapReduce jobs on the XML to pull out various pieces of data,
do analytics, etc.  I am using an XML input type based on the WikipediaInputFormat from the
examples.  What I have been doing is 1) loading the entire XML into HDFS as a single document
2) parsing the XML on some tag<foo>  and storing each one of these instances as the
content of a new row in Accumulo, using the name of the instance as the row id.  I then run
other MR jobs that scan this table, pull out and parse the XML and do whatever I need to do
with the data.
>>>> My question is, is there any advantage to storing the XML in Accumulo versus
just leaving it in HDFS and parsing it from there?  Either as a large block of XML or as individual
chunks, perhaps  using Hadoop Archive to handle the small-file problem?  The actual XML will
not be queried in and of itself but is part other analysis processes.
>>>> Thanks,
>>>> Ralph
>>>> __________________________________________________
>>>> Ralph Perko
>>>> Pacific Northwest National Laboratory

View raw message