accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Seidl, Ed" <>
Subject appending data to tables (partitioning?)
Date Fri, 20 Jul 2012 18:59:29 GMT
Hi all, I have another question about dealing with large amounts of data…

I'm trying to store large blobs of data inside of accumulo, by means of doing directory imports.
 These blobs are binary and are referenced by other tables.  They can also get quite large.
 In an effort to cut down on the amount of time spent doing compactions on this data, I've
taken to using what amounts to an increasing sequence number for the rowID's, so now a major
compaction amounts to a copy of the data, but no merging has to happen.  I can also play with
the table.split.threshold property for the table to keep tablets from splitting.  But sometimes
a compaction will occur, which results in a lot of data being unnecessarily copied from one
rfile to another.  So, my question…is there any way to signal to accumulo that rfiles that
I'm trying to do an importdirectory on should just be used as is and no compaction is desired
(I.e. Just move the rfiles into the table directory rather than moving them to a temp directory
for later merging upon compaction)?  The paradigm I'm shooting for here is like oracle partitioned
tables, where you can fill a tmp table with new data, and then swap that tmp table with an
empty partition on the target table….the whole process taking seconds since no data moves,
just pointers in the guts of the DB.

If there's no current way to do this, would such a mechanism be desirable to anyone other
than me?  I wouldn't mind taking a stab at implementing this, but don't want to start if it's
a feature that no one would want or is thought to be totally stupid  in the first place :)
(As an aside, yes, I've though of storing the data in hdfs and keeping a pointer to it in
accumulo, but the way I want to interact w/ the data is way easier if it's all in accumulo


View raw message