accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cardon, Tejay E" <>
Subject RE: EXTERNAL: Re: Large files in Accumulo
Date Thu, 23 Aug 2012 21:34:39 GMT
Thanks Eric,
I was afraid that would be the case.  If I understand you correctly, putting a GB file into
Accumulo would be a bad idea.  Given that fact, are there any strategies available to ensure
that a given file in HDFS is co-located with the index info for that file in Accumulo? (I
would assume not).  In my case, I could use Accumulo to store my indexes for fast query, but
then have them return a URL/URI to the actual file.  However, I have to process each of those
files further to get to my final result, and I was hoping to do the second stage of processing
without having to return intermediate results.  Am I correct in assuming that this can't be


From: Eric Newton []
Sent: Thursday, August 23, 2012 3:06 PM
Subject: Re: EXTERNAL: Re: Large files in Accumulo

An entire mutation needs to fit in memory several times, so you should not attempt to push
in a single mutation larger than a 100MB unless you have a lot of memory in your tserver/logger.

And while I'm at it, large keys will create large indexes, so try to keep your (row,cf,cq,cv)
under 100K.

On Thu, Aug 23, 2012 at 4:37 PM, Cardon, Tejay E <<>>
In my case I'll be doing a document based index store (like the wikisearch example), but my
documents may be as large as several GB.  I just wanted to pick the collective brain of the
group to see if I'm walking into a major headache.  If it's never been tried before, then
I'll give it a shot and report back.


From: William Slacum [<>]
Sent: Thursday, August 23, 2012 2:07 PM
Subject: EXTERNAL: Re: Large files in Accumulo

Are these RFiles as a whole? I know at some point HBase needed to have entire rows fit into
memory; Accumulo does not have this restriction.
On Thu, Aug 23, 2012 at 12:55 PM, Cardon, Tejay E <<>>
Alright, this one's a quick question.  I've been told that HBase does not perform well if
large (> 100MB) files are stored in it).  Does Accumulo have similar trouble?  If so, can
it be overcome by storing the large files in their own locality group?


View raw message