accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Best practices in sizing values?
Date Mon, 10 Jun 2013 00:45:15 GMT
One thing I wanted to add is that you will likely fare quite well 
storing your very large files as a linked-list of bytes (multiple 
key-value pairs make up one of your large blobs of text). You can even 
use your segmentation of the large chunks of text to do more efficient 
seek'ing within the file, if applicable to your application.

I personally don't like the idea of using storing HDFS URIs into 
Accumulo. If you think about what Accumulo is providing you, one of the 
things it's great at is abstracting away the notion of that underlying 
filesystem. Just a thought.

On 06/09/2013 08:21 PM, Frank Smith wrote:
> So, what are your thoughts on storing a bunch of small files on the
> HDFS?  Sequence Files, Avro?
>
> I will note that these are essentially write once and read heavy chunks
> of text.
>
>  > Date: Sun, 9 Jun 2013 17:08:42 -0400
>  > Subject: Re: Best practices in sizing values?
>  > From: ctubbsii@apache.org
>  > To: user@accumulo.apache.org
>  >
>  > At the very least, I would keep it under the size of your compressed
>  > data blocks in your RFiles (this may mean you should increase value of
>  > table.file.compress.blocksize to be larger than the default of 100K).
>  >
>  > You could also tweak this according to your application. Say, for
>  > example, you wanted to limit the additional work to resolve the
>  > pointer and retrieve from HDFS only 5% of the time, you could sample
>  > your data, and choose a cutoff value that keeps 95% of your data in
>  > the Accumulo table.
>  >
>  > Personally, I like to keep things under 1MB in the value, and under 1K
>  > in the key, as a crude rule of thumb, but it very much depends on the
>  > application.
>  >
>  > --
>  > Christopher L Tubbs II
>  > http://gravatar.com/ctubbsii
>  >
>  >
>  > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith
> <francis.h.smith@outlook.com> wrote:
>  > > I have an application where I have a block of unstructured text.
> Normally
>  > > that text is relatively small <500k, but there are conditions where
> it can
>  > > be up to GBs of text.
>  > >
>  > > I was considering of using a threshold where I simply decide to
> change from
>  > > storing the text in the value of my mutation, and just add a
> reference to
>  > > the HDFS location, but I wanted to get some advice on where that
> threshold
>  > > should (best practice) or must (system limitation) be?
>  > >
>  > > Also, can I stream data into a value, vice passing a byte array?
> Similar to
>  > > how CLOBs and BLOBs are handled in an RDBMS.
>  > >
>  > > Thanks,
>  > >
>  > > Frank

Mime
View raw message