accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie Rinaldi <billie.rina...@gmail.com>
Subject Re: Best practices in sizing values?
Date Mon, 10 Jun 2013 01:45:49 GMT
See also the filedata example of splitting a file into chunks in a similar
way to what Josh describes.
http://accumulo.apache.org/1.5/examples/filedata.html
There is more information about the table structure for this example under
Data Table in the dirlist example.
http://accumulo.apache.org/1.5/examples/dirlist.html


On Sun, Jun 9, 2013 at 6:33 PM, Josh Elser <josh.elser@gmail.com> wrote:

> You would likely want to keep some common prefix in the key. This would
> make seeking to an arbitrary point in the file easier.
>
> e.g.
>
> doc1 data:0000001 [] _bytes_
> doc1 data:0000002 [] _bytes_
> doc1 data:0000003 [] _bytes_
>
> As far as chunk size, Christopher's advice is probably better than
> anything I could provide without direct experimentation with the HDFS block
> size, Accumulo table.file.compress.blocksize, and size of each Value. The
> best choice for you likely depends on your usage patterns.
>
> You could even store additional metadata for each "document" you store,
> such as chunk size, number of chunks, etc. Lots of flexibility with how you
> could approach this given the flexibility Accumulo provides with the
> columns you can use.
>
>
> On 06/09/2013 08:56 PM, Frank Smith wrote:
>
>> Josh,
>>
>> That is an interesting idea.  Would you link them through the keys, or
>> append the key to the end of the value of the previous part?
>>
>> You have thoughts on how big the chunks should be?
>>
>> I definitely agree that it would be better to keep the data in Accumulo,
>> vice references to the HDFS.  Accumulo already gives me a scheme for
>> organizing files very effectively on the HDFS, rolling my own doesn't
>> make sense, unless I don't have a good sense for the limitations of a
>> tablet server to manage those large files.
>>
>> Thanks,
>>
>> Frank
>>
>>  > Date: Sun, 9 Jun 2013 20:45:15 -0400
>>  > From: josh.elser@gmail.com
>>  > To: user@accumulo.apache.org
>>  > Subject: Re: Best practices in sizing values?
>>  >
>>  > One thing I wanted to add is that you will likely fare quite well
>>  > storing your very large files as a linked-list of bytes (multiple
>>  > key-value pairs make up one of your large blobs of text). You can even
>>  > use your segmentation of the large chunks of text to do more efficient
>>  > seek'ing within the file, if applicable to your application.
>>  >
>>  > I personally don't like the idea of using storing HDFS URIs into
>>  > Accumulo. If you think about what Accumulo is providing you, one of the
>>  > things it's great at is abstracting away the notion of that underlying
>>  > filesystem. Just a thought.
>>  >
>>  > On 06/09/2013 08:21 PM, Frank Smith wrote:
>>  > > So, what are your thoughts on storing a bunch of small files on the
>>  > > HDFS? Sequence Files, Avro?
>>  > >
>>  > > I will note that these are essentially write once and read heavy
>> chunks
>>  > > of text.
>>  > >
>>  > > > Date: Sun, 9 Jun 2013 17:08:42 -0400
>>  > > > Subject: Re: Best practices in sizing values?
>>  > > > From: ctubbsii@apache.org
>>  > > > To: user@accumulo.apache.org
>>  > > >
>>  > > > At the very least, I would keep it under the size of your
>> compressed
>>  > > > data blocks in your RFiles (this may mean you should increase
>> value of
>>  > > > table.file.compress.blocksize to be larger than the default of
>> 100K).
>>  > > >
>>  > > > You could also tweak this according to your application. Say, for
>>  > > > example, you wanted to limit the additional work to resolve the
>>  > > > pointer and retrieve from HDFS only 5% of the time, you could
>> sample
>>  > > > your data, and choose a cutoff value that keeps 95% of your data
in
>>  > > > the Accumulo table.
>>  > > >
>>  > > > Personally, I like to keep things under 1MB in the value, and
>> under 1K
>>  > > > in the key, as a crude rule of thumb, but it very much depends on
>> the
>>  > > > application.
>>  > > >
>>  > > > --
>>  > > > Christopher L Tubbs II
>>  > > > http://gravatar.com/ctubbsii
>>  > > >
>>  > > >
>>  > > > On Sun, Jun 9, 2013 at 4:37 PM, Frank Smith
>>  > > <francis.h.smith@outlook.com> wrote:
>>  > > > > I have an application where I have a block of unstructured text.
>>  > > Normally
>>  > > > > that text is relatively small <500k, but there are conditions
>> where
>>  > > it can
>>  > > > > be up to GBs of text.
>>  > > > >
>>  > > > > I was considering of using a threshold where I simply decide
to
>>  > > change from
>>  > > > > storing the text in the value of my mutation, and just add a
>>  > > reference to
>>  > > > > the HDFS location, but I wanted to get some advice on where
that
>>  > > threshold
>>  > > > > should (best practice) or must (system limitation) be?
>>  > > > >
>>  > > > > Also, can I stream data into a value, vice passing a byte array?
>>  > > Similar to
>>  > > > > how CLOBs and BLOBs are handled in an RDBMS.
>>  > > > >
>>  > > > > Thanks,
>>  > > > >
>>  > > > > Frank
>>
>

Mime
View raw message