accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Seidl, Ed" <>
Subject Re: stupid/dangerous batch load question
Date Wed, 28 May 2014 18:34:58 GMT
That's the rub.  I have 120 reducers running, so I wind up w/ 120 RFiles to import.  I haven't
tried playing w/ a custom partitioner to send adjacent ranges to reducers so the rfiles won't
have overlapping keys.  Perhaps that would help?


From: Mike Drob <<>>
Reply-To: "<>" <<>>
Date: Wednesday, May 28, 2014 11:22 AM
To: "<>" <<>>
Subject: Re: stupid/dangerous batch load question

Are you partitioning the resultant files by the existing table splits, or just sending everything
to one file?

If you are importing multiple files, then there is potential that some of the files succeed
and others fail. Depending on how your data is laid out, this may cause application level
corruption, but the underlying key/value store should be ok.

On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed <<>>
I have a large amount of data that I am batch loading into accumulo.  I'm using mapreduce
to read in chunks of data and write out rfiles to be loaded with importdirectory.  I've noticed
that the import will hang for longer and longer times as more data is added.  For instance,
one table, which currently has ~2500 tablets, now takes around 2 hours to process the importdirectory.

In poking around in the source for TableOperationsImpl (1.5.0), I see that there is an option
to not wait on certain operations (like compact).  Would it be dangerous to (optionally) return
immediately from importdirectory, and instead check the fail directory to detect errors in
the import?  I know this will eventually cause a backup in the staging directories, but is
there any potential to corrupt the tables?


View raw message