Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA0BCD880 for ; Tue, 19 Jun 2012 12:14:04 +0000 (UTC) Received: (qmail 61191 invoked by uid 500); 19 Jun 2012 12:14:04 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 59595 invoked by uid 500); 19 Jun 2012 12:14:02 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 59036 invoked by uid 99); 19 Jun 2012 12:14:01 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 12:14:01 +0000 Received: from localhost (HELO mail-pz0-f41.google.com) (127.0.0.1) (smtp-auth username afuchs, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2012 12:14:01 +0000 Received: by dakp5 with SMTP id p5so9074395dak.0 for ; Tue, 19 Jun 2012 05:14:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.241.232 with SMTP id wl8mr63217466pbc.106.1340108040876; Tue, 19 Jun 2012 05:14:00 -0700 (PDT) Received: by 10.68.131.104 with HTTP; Tue, 19 Jun 2012 05:14:00 -0700 (PDT) In-Reply-To: References: <1704325682058731107@unknownmsgid> Date: Tue, 19 Jun 2012 08:14:00 -0400 Message-ID: Subject: Re: Can I connect an InputStream to a Mutation value? From: Adam Fuchs To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=047d7b339bf5bcb21204c2d23751 --047d7b339bf5bcb21204c2d23751 Content-Type: text/plain; charset=ISO-8859-1 There's also the concern of elements of the document that are too large by themselves. A general purpose streaming solution would include support for any kind of objects passed in, not just XML with small elements. I think the fact that it is an XML document is probably a red herring in this case. In the past, what we have done is solve this on the application side by breaking up large objects into chunks and then using a key structure that groups and maintains the order of the chunks. This usually means that we append a sequence number to the column qualifier using an integer encoding. The filedata example that Billie referred to does this. Accumulo would benefit from some sort of general purpose fragmentation solution for streaming large objects, and an InputStream/OutputStream solution might be good for that. Sounds like a fun project! Adam On Mon, Jun 18, 2012 at 2:06 PM, Marc P. wrote: > I'm sorry, I must be missing something. > > Why does the schema matter? If you were to build keys from all > attributes and elements, you could, at any point, rebuild the XML > document. You could store the heirarchy, by virtue of your keys. > > If you were to do that, the previous suggestions would be applicable. > Realistically, if you stored the entire XML file into a given > key/value pair, your heap elements will be borne upon thrift reception > ( at the client ), therefore, streaming would only add complexity and > additional memory overhead. It wouldn't give you what you want. > > Splitting the file amongst keys can maintain hierarchy, allow you to > rebuild the XML doc, and store large records into the value. > > On Mon, Jun 18, 2012 at 2:00 PM, David Medinets > wrote: > > Thanks for the offer. I thinking of a situation were I don't know the > > schema ahead of time. For example, a JMS queue that I simply want to > > store the XML somewhere. And let some other program parse it. This is > > a thought experiment. > > > > On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar wrote: > >> David, > >> > >> Can you give a taste of the schema of the XML? With that we may be > >> able to help break the XML file up into keys and help create an index > >> for it. IMHO that's the power you would get from accumulo. If you just > >> want it as one big lump, and don't need to search it or only retrieve > >> portions of the file, then putting it in accumulo is just adding > >> overhead to hdfs. > >> > >> > >> Sent from my iPhone > >> > >> On Jun 17, 2012, at 9:54 AM, David Medinets > wrote: > >> > >>> Some of the XML records that I work with are over 50M. I was hoping to > >>> store them inside of Accumulo instead of the text-based HDFS XML super > >>> file currently being used. However, since they are so large I can't > >>> create a Value object without running out of memory. Storing values > >>> this large may simply be using the wrong tool, please let me know. > --047d7b339bf5bcb21204c2d23751 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable There's also the concern of elements of the document that are too large= by themselves. A general purpose streaming solution would include support = for any kind of objects passed in, not just XML with small elements. I thin= k the fact that it is an XML document is probably a red herring in this cas= e.

In the past, what we have done is solve this on the application si= de by breaking up large objects into chunks and then using a key structure = that groups and maintains the order of the chunks. This usually means that = we append a sequence number to the column qualifier using an integer encodi= ng. The filedata example that Billie referred to does this. Accumulo would = benefit from some sort of general purpose fragmentation solution for stream= ing large objects, and an InputStream/OutputStream solution might be good f= or that. Sounds like a fun project!

Adam


On Mo= n, Jun 18, 2012 at 2:06 PM, Marc P. <marc.parisi@gmail.com> wrote:
I'm sorry, I must be missing something.<= br>
Why does the schema matter? If you were to build keys from all
attributes and elements, you could, at any point, rebuild the XML
document. You could store the heirarchy, by virtue of your keys.

If you were to do that, the previous suggestions would be applicable.
Realistically, if you stored the entire XML file into a given
key/value pair, your heap elements will be borne upon thrift reception
( at the client ), therefore, streaming would only add complexity and
additional memory overhead. It wouldn't give you what you want.

Splitting the file amongst keys can maintain hierarchy, allow you to
rebuild the XML doc, and store large records into the value.

On Mon, Jun 18, 2012 at 2:00 PM, David Medinets
<david.medinets@gmail.com> wrote:
> Thanks for the offer. I thinking of a situation were I don't know = the
> schema ahead of time. For example, a JMS queue that I simply want to > store the XML somewhere. And let some other program parse it. This is<= br> > a thought experiment.
>
> On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar <
klucar@gmail.com> wrote:
>> David,
>>
>> Can you give a taste of the schema of the XML? With that we may be=
>> able to help break the XML file up into keys and help create an in= dex
>> for it. IMHO that's the power you would get from accumulo. If = you just
>> want it as one big lump, and don't need to search it or only r= etrieve
>> portions of the file, then putting it in accumulo is just adding >> overhead to hdfs.
>>
>>
>> Sent from my iPhone
>>
>> On Jun 17, 2012, at 9:54 AM, David Medinets <david.medinets@gmail.com> wrote:
>>
>>> Some of the XML records that I work with are over 50M. I was h= oping to
>>> store them inside of Accumulo instead of the text-based HDFS X= ML super
>>> file currently being used. However, since they are so large I = can't
>>> create a Value object without running out of memory. Storing v= alues
>>> this large may simply be using the wrong tool, please let me k= now.

--047d7b339bf5bcb21204c2d23751--