Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAH0f5v1tfR5P0rae_XapxTqz6MBchMaO40aUDiYiBMGNZbJVOw@mail.gmail.com>
References: 
 <CAOiJXP4KKkx7iKgrjZboAh_QWpP1Ocr7CaQkK8Q46eXrB4MM5g@mail.gmail.com>
	<1704325682058731107@unknownmsgid>
	<CAOiJXP7aQnLhUxvbEtXYnmtvFVhEe2LmdEDcW-x_bfANxGn1tA@mail.gmail.com>
	<CAH0f5v1tfR5P0rae_XapxTqz6MBchMaO40aUDiYiBMGNZbJVOw@mail.gmail.com>
Date: Tue, 19 Jun 2012 08:14:00 -0400
Message-ID: 
 <CAPMpPc4CNjSRcb0+ADAcvLCNvaYOTUAF3Ve9RvcRG3kyQVjjcg@mail.gmail.com>
Subject: Re: Can I connect an InputStream to a Mutation value?
From: Adam Fuchs <afuchs@apache.org>
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=047d7b339bf5bcb21204c2d23751

--047d7b339bf5bcb21204c2d23751
Content-Type: text/plain; charset=ISO-8859-1

There's also the concern of elements of the document that are too large by
themselves. A general purpose streaming solution would include support for
any kind of objects passed in, not just XML with small elements. I think
the fact that it is an XML document is probably a red herring in this case.

In the past, what we have done is solve this on the application side by
breaking up large objects into chunks and then using a key structure that
groups and maintains the order of the chunks. This usually means that we
append a sequence number to the column qualifier using an integer encoding.
The filedata example that Billie referred to does this. Accumulo would
benefit from some sort of general purpose fragmentation solution for
streaming large objects, and an InputStream/OutputStream solution might be
good for that. Sounds like a fun project!

Adam


On Mon, Jun 18, 2012 at 2:06 PM, Marc P. <marc.parisi@gmail.com> wrote:

> I'm sorry, I must be missing something.
>
> Why does the schema matter? If you were to build keys from all
> attributes and elements, you could, at any point, rebuild the XML
> document. You could store the heirarchy, by virtue of your keys.
>
> If you were to do that, the previous suggestions would be applicable.
> Realistically, if you stored the entire XML file into a given
> key/value pair, your heap elements will be borne upon thrift reception
> ( at the client ), therefore, streaming would only add complexity and
> additional memory overhead. It wouldn't give you what you want.
>
> Splitting the file amongst keys can maintain hierarchy, allow you to
> rebuild the XML doc, and store large records into the value.
>
> On Mon, Jun 18, 2012 at 2:00 PM, David Medinets
> <david.medinets@gmail.com> wrote:
> > Thanks for the offer. I thinking of a situation were I don't know the
> > schema ahead of time. For example, a JMS queue that I simply want to
> > store the XML somewhere. And let some other program parse it. This is
> > a thought experiment.
> >
> > On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar <klucar@gmail.com> wrote:
> >> David,
> >>
> >> Can you give a taste of the schema of the XML? With that we may be
> >> able to help break the XML file up into keys and help create an index
> >> for it. IMHO that's the power you would get from accumulo. If you just
> >> want it as one big lump, and don't need to search it or only retrieve
> >> portions of the file, then putting it in accumulo is just adding
> >> overhead to hdfs.
> >>
> >>
> >> Sent from my iPhone
> >>
> >> On Jun 17, 2012, at 9:54 AM, David Medinets <david.medinets@gmail.com>
> wrote:
> >>
> >>> Some of the XML records that I work with are over 50M. I was hoping to
> >>> store them inside of Accumulo instead of the text-based HDFS XML super
> >>> file currently being used. However, since they are so large I can't
> >>> create a Value object without running out of memory. Storing values
> >>> this large may simply be using the wrong tool, please let me know.
>

--047d7b339bf5bcb21204c2d23751
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

There&#39;s also the concern of elements of the document that are too large=
 by themselves. A general purpose streaming solution would include support =
for any kind of objects passed in, not just XML with small elements. I thin=
k the fact that it is an XML document is probably a red herring in this cas=
e.<div>
<br><div>In the past, what we have done is solve this on the application si=
de by breaking up large objects into chunks and then using a key structure =
that groups and maintains the order of the chunks. This usually means that =
we append a sequence number to the column qualifier using an integer encodi=
ng. The filedata example that Billie referred to does this. Accumulo would =
benefit from some sort of general purpose fragmentation solution for stream=
ing large objects, and an InputStream/OutputStream solution might be good f=
or that. Sounds like a fun project!</div>
<div><br></div><div>Adam</div><div><br><br><div class=3D"gmail_quote">On Mo=
n, Jun 18, 2012 at 2:06 PM, Marc P. <span dir=3D"ltr">&lt;<a href=3D"mailto=
:marc.parisi@gmail.com" target=3D"_blank">marc.parisi@gmail.com</a>&gt;</sp=
an> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I&#39;m sorry, I must be missing something.<=
br>
<br>
Why does the schema matter? If you were to build keys from all<br>
attributes and elements, you could, at any point, rebuild the XML<br>
document. You could store the heirarchy, by virtue of your keys.<br>
<br>
If you were to do that, the previous suggestions would be applicable.<br>
Realistically, if you stored the entire XML file into a given<br>
key/value pair, your heap elements will be borne upon thrift reception<br>
( at the client ), therefore, streaming would only add complexity and<br>
additional memory overhead. It wouldn&#39;t give you what you want.<br>
<br>
Splitting the file amongst keys can maintain hierarchy, allow you to<br>
rebuild the XML doc, and store large records into the value.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Mon, Jun 18, 2012 at 2:00 PM, David Medinets<br>
&lt;<a href=3D"mailto:david.medinets@gmail.com">david.medinets@gmail.com</a=
>&gt; wrote:<br>
&gt; Thanks for the offer. I thinking of a situation were I don&#39;t know =
the<br>
&gt; schema ahead of time. For example, a JMS queue that I simply want to<b=
r>
&gt; store the XML somewhere. And let some other program parse it. This is<=
br>
&gt; a thought experiment.<br>
&gt;<br>
&gt; On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar &lt;<a href=3D"mailto:kluc=
ar@gmail.com">klucar@gmail.com</a>&gt; wrote:<br>
&gt;&gt; David,<br>
&gt;&gt;<br>
&gt;&gt; Can you give a taste of the schema of the XML? With that we may be=
<br>
&gt;&gt; able to help break the XML file up into keys and help create an in=
dex<br>
&gt;&gt; for it. IMHO that&#39;s the power you would get from accumulo. If =
you just<br>
&gt;&gt; want it as one big lump, and don&#39;t need to search it or only r=
etrieve<br>
&gt;&gt; portions of the file, then putting it in accumulo is just adding<b=
r>
&gt;&gt; overhead to hdfs.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Sent from my iPhone<br>
&gt;&gt;<br>
&gt;&gt; On Jun 17, 2012, at 9:54 AM, David Medinets &lt;<a href=3D"mailto:=
david.medinets@gmail.com">david.medinets@gmail.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;&gt; Some of the XML records that I work with are over 50M. I was h=
oping to<br>
&gt;&gt;&gt; store them inside of Accumulo instead of the text-based HDFS X=
ML super<br>
&gt;&gt;&gt; file currently being used. However, since they are so large I =
can&#39;t<br>
&gt;&gt;&gt; create a Value object without running out of memory. Storing v=
alues<br>
&gt;&gt;&gt; this large may simply be using the wrong tool, please let me k=
now.<br>
</div></div></blockquote></div><br></div></div>

--047d7b339bf5bcb21204c2d23751--