accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie J Rinaldi <>
Subject Re: appending data to tables (partitioning?)
Date Fri, 20 Jul 2012 20:23:02 GMT
On Friday, July 20, 2012 2:59:29 PM, "Ed Seidl" <>: 
> Hi all, I have another question about dealing with large amounts of
> data…
> I'm trying to store large blobs of data inside of accumulo, by means
> of doing directory imports. These blobs are binary and are referenced
> by other tables. They can also get quite large. In an effort to cut
> down on the amount of time spent doing compactions on this data, I've
> taken to using what amounts to an increasing sequence number for the
> rowID's, so now a major compaction amounts to a copy of the data, but
> no merging has to happen. I can also play with the
> table.split.threshold property for the table to keep tablets from
> splitting. But sometimes a compaction will occur, which results in a
> lot of data being unnecessarily copied from one rfile to another. So,
> my question…is there any way to signal to accumulo that rfiles that
> I'm trying to do an importdirectory on should just be used as is and
> no compaction is desired (I.e. Just move the rfiles into the table
> directory rather than moving them to a temp directory for later
> merging upon compaction)? The paradigm I'm shooting for here is like
> oracle partitioned tables, where you can fill a tmp table with new
> data, and then swap that tmp table with an empty partition on the
> target table….the whole process taking seconds since no data moves,
> just pointers in the guts of the DB.

One thing you should think about is making it so that you only have one file per tablet, i.e.
that you create a new split point for every new file that you import.  This should be doable
if your files are pretty large and you don't end up having too many tablets.  If there is
only one file per tablet, it won't compact unless you tell it to.

If you want to have multiple files per tablet, there are a number of parameters you should
think about.  However, you should make sure that you don't have too many files per tablet
because 1) query performance will suffer and 2) there is a limit to the number of files that
a tablet server will open.  The limit to open files is adjustable.  For scan, it defaults
to 100 files for all the tablets, and for major compaction it defaults to 10 files per tablet
(but the compaction can be performed in stages).

To change the compaction criteria, adjust table.file.max and table.compaction.major.ratio.
 table.file.max is the maximum number of files that a tablet can have.  If a tablet has more
files than this, it will compact.  table.compaction.major.ratio governs when compaction occurs
when a tablet has fewer files than the maximum.  It also governs which files are compacted
together in either case.  Raising the ratio will make compactions happen less.  If table.file.max
is larger than the number of files you expect to have per tablet, setting table.compaction.major.ratio
to the same value as table.file.max should keep it from compacting unless there is high variation
in your file sizes.  A set of files is compacted into a single file if the size of the largest
file times the ratio is <= the sum of the sizes of the files.


> If there's no current way to do this, would such a mechanism be
> desirable to anyone other than me? I wouldn't mind taking a stab at
> implementing this, but don't want to start if it's a feature that no
> one would want or is thought to be totally stupid in the first place
> :) (As an aside, yes, I've though of storing the data in hdfs and
> keeping a pointer to it in accumulo, but the way I want to interact w/
> the data is way easier if it's all in accumulo tables.)
> Thanks,
> Ed

View raw message