accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: stupid/dangerous batch load question
Date Wed, 28 May 2014 18:25:00 GMT
On 5/28/14, 2:22 PM, Mike Drob wrote:
> Are you partitioning the resultant files by the existing table splits,
> or just sending everything to one file?

Emphasis on this. Sending a large file to every tablet for a table can 
be very expensive. Trying to align the files you're generating with the 
splits of a table will help alleviate that cost.

> If you are importing multiple files, then there is potential that some
> of the files succeed and others fail. Depending on how your data is laid
> out, this may cause application level corruption, but the underlying
> key/value store should be ok.
>
>
> On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed <seidl2@llnl.gov
> <mailto:seidl2@llnl.gov>> wrote:
>
>     I have a large amount of data that I am batch loading into accumulo.
>       I'm using mapreduce to read in chunks of data and write out rfiles
>     to be loaded with importdirectory.  I've noticed that the import
>     will hang for longer and longer times as more data is added.  For
>     instance, one table, which currently has ~2500 tablets, now takes
>     around 2 hours to process the importdirectory.
>
>     In poking around in the source for TableOperationsImpl (1.5.0), I
>     see that there is an option to not wait on certain operations (like
>     compact).  Would it be dangerous to (optionally) return immediately
>     from importdirectory, and instead check the fail directory to detect
>     errors in the import?  I know this will eventually cause a backup in
>     the staging directories, but is there any potential to corrupt the
>     tables?
>
>     Thanks,
>     Ed
>
>

Mime
View raw message