accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: XML Storage - Accumulo or HDFS
Date Thu, 07 Jun 2012 12:06:51 GMT
So, if your XML looks like the snippet you posted, it's extremely easy 
to fetch records based on the KEY_FIELD or TAG element. A (relatively) 
flat XML document is rather trivial to map into the wikipedia example. 
As I was saying previously, it gets trickier when you have a deep or a 
deep and wide structure.

If your requirement is to find the /n/ RECORDs before and after a given 
RECORD, then yes, the wiki example wouldn't make much sense; however, 
you could add an attribute to each RECORD to denote positional 
information in the original file which would alleviate this problem. 
 From an application sense, it usually doesn't make sense to index 
documents purely off of their positional information in the source data 
(as you suggested using the byte file offsets) because that's not how 
you're going to want to query it in your application. I would assume 
you'd want to be querying off of KEY_FIELD or TAG.

- Josh

On 6/7/12 7:29 AM, David Medinets wrote:
> On Wed, Jun 6, 2012 at 10:50 PM, Josh Elser<josh.elser@gmail.com>  wrote:
>>   Aside from losing the hierarchy
>> knowledge, if you have a skewed distribution of elements in the XML
>> document, you can't get good locality in your query/analytic. What was your
>> idea behind storing the offsets?
> <RECORDS>
>   <RECORD>
>    <KEY_FIELD/>
>    <TAG/>
>   </RECORD>
>   <RECORD>
>    <KEY_FIELD/>
>    <TAG/>
>   </RECORD>
> </RECORDS>
>
> My XML looks like that. I don't know how the information in the XML
> will be used in the future and I don't want to re-scan large numbers
> of XML to find a single record. For example, yesterday we found a
> potential bug. My bug analysis showed the source data was in record X
> of 450,000 records. Since I know which XML file held that record, I
> was able to get that file locally and use command-line tools to find
> surrounding information. My XML file might have 200 tags but normally
> I only need 45 of them. My XML is without hierarchy.

Mime
View raw message