accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: XML Storage - Accumulo or HDFS
Date Fri, 08 Jun 2012 01:57:39 GMT
My "inflated key" comment, I'll pull from Eric Newton's comment on the 
"Table design" thread:

"Accumulo will accomodate keys that are very large (like 100K) but I 
don't recommend it. It makes indexes big and slows down just about every 

As applied to your example, you might generate the following keys if you 
took the wikisearch approach:

# Represent your document as such: the row "4" being an arbitrary 
bucket, and the CF "1234abcd" being some unique identifier for your 
document (a hash of <book> for example)

4   1234abcd:title\x00basket weaving
4   1234abcd:author\x00bob
4   1234abcd:toc\x00stuff
4   1234abcd:citation\x00another book

# Then some indices inside the same row (bucket), creating an 
in-partition index over the fields of your data. You could also shove 
the tokenized content from your chapters in here.
4   fi\x00title:basket weaving\x001234abcd
4   fi\x00author:bob\x001234abcd
4   fi\x00toc:stuff\x001234abcd
4   fi\x00citation:another book\x001234abcd

# For those big chapters, store them off to the side, perhaps in their 
own locality group. Will keep this data in separate files.
4 chapters:1234abcd\x001    Value:byte[chapter one data]
4 chapters:1234abcd\x002    Value:byte[chapter two data]

# Then perhaps some records pointing to data you expect users to query 
on in a separate table (inverted index)
basket weaving    title:4\x001234abcd
bob    author:4\x001234abcd
another book    citation:4\x001234abcd

- Josh

On 6/7/2012 10:48 AM, Perko, Ralph J wrote:
> My use-case is very similar to the Wikipedia example. I'm not sure what
> you mean by the inflated key.  Can you expand on that?  I am not really
> pulling out individual elements/attributes to simply store them apart from
> the XML.  Any element I pull out is part of a larger analytic process and
> it is this result I store.  I am doing some graph worked based on
> relationships between elements.
> Example:
> <books>
>    <book>
>      <title>basket weaving</title>
>      <author>bob</author>
>      <toc>Š</toc>
>      <chapter number=1>lots of text here</chapter>
>      <chapter number=2>even more text here</chapter>
>      <citation>another book</citation>
>    </book>
> </books>
> Each "book" is a record.  The book title is the row id.  The content is
> the XML<book>..</book>
> My table then has other columns such as "word count" or "character count"
> stored in the table.
> Table example:
> Row: basket weaving
> Col family: content
> Col qual: xml
> Value:<book>Š</book>
> Row: basket weaving
> Col family: metrics
> Col qual: word count
> Value: 12345
> Row: basket weaving
> Col family:cites
> Col qual: another book
> Value: -- nothing meaningful
> Row: another book
> Col family:cited by
> Col qual: basket weaving
> Value: -- nothing meaningful
> I use the "cites" and "cited by" qualifiers for graphs
> On 6/6/12 7:50 PM, "Josh Elser"<>  wrote:
>> +1, Bill. Assuming you aren't doing anything crazy in your XML files,
>> the wikipedia example should get you pretty far. That being said, the
>> structure used in the wikipedia example doesn't handle large lists of
>> elements -- short explanation: an attribute of a document is stored as
>> one key-vale pair, so if you have lot of large lists, you inflate the
>> key which does bad things. That in mind, there are small changes you can
>> make to the table structure to store those lists more efficiently and
>> still maintain the semantic representation (Bill's graph comment).
>> David, ignoring any issues of data locality of the blocks in your large
>> XML files, storing byte offsets into a hierarchical data structure (XML)
>> seems like a sub-optimal solution to me. Aside from losing the hierarchy
>> knowledge, if you have a skewed distribution of elements in the XML
>> document, you can't get good locality in your query/analytic. What was
>> your idea behind storing the offsets?
>> - Josh
>> On 6/6/2012 10:19 PM, William Slacum wrote:
>>> If your XML documents are really just lists of elements/objects, and
>>> what you want to run your analytics on are subsets of those elements
>>> (even across XML documents), then it makes sense to take a document
>>> store approach similar to what the Wikipedia example has done. This
>>> allows you to index specific portions of elements, create graphs and
>>> apply visibility labels to specific attributes in a given object tree.
>>> On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
>>> <>   wrote:
>>>> I can't think of any advantage to storing XML inside Accumulo. I am
>>>> interested to learn some details about your view. Storing the
>>>> extracted information and the location of the HDFS file that sourced
>>>> the information does make sense to me. In fact, it might be useful to
>>>> store file positions in Accumulo so it's easy to get back to specific
>>>> spots in the XML file. For example, if you had an XML file with many
>>>> records in it and there was no reason to immediately decompose each
>>>> record.
>>>> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum<>
>>>> wrote:
>>>>> There are advantages to using Accumulo to store the contents of your
>>>>> XML documents, depending on their structure and what you want to end
>>>>> up taking out of them. Are you trying to emulate the document store
>>>>> pattern that the Wikipedia example uses?
>>>>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J<>
>>>>> wrote:
>>>>>> Hi,  I am working with large chunks of XML, anywhere from 1 ­ 50
>>>>>> each.  I am running several different MapReduce jobs on the XML to
>>>>>> pull out various pieces of data, do analytics, etc.  I am using an
>>>>>> XML input type based on the WikipediaInputFormat from the examples.
>>>>>> What I have been doing is 1) loading the entire XML into HDFS as
>>>>>> single document 2) parsing the XML on some tag<foo>   and storing
>>>>>> one of these instances as the content of a new row in Accumulo, using
>>>>>> the name of the instance as the row id.  I then run other MR jobs
>>>>>> that scan this table, pull out and parse the XML and do whatever
>>>>>> need to do with the data.
>>>>>> My question is, is there any advantage to storing the XML in
>>>>>> Accumulo versus just leaving it in HDFS and parsing it from there?
>>>>>> Either as a large block of XML or as individual chunks, perhaps
>>>>>> using Hadoop Archive to handle the small-file problem?  The actual
>>>>>> XML will not be queried in and of itself but is part other analysis
>>>>>> processes.
>>>>>> Thanks,
>>>>>> Ralph
>>>>>> __________________________________________________
>>>>>> Ralph Perko
>>>>>> Pacific Northwest National Laboratory

View raw message