accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Tubbs <ctubb...@gmail.com>
Subject Re: EXTERNAL: Re: Large files in Accumulo
Date Thu, 23 Aug 2012 22:11:04 GMT
You can still index a 1GB file... you just shouldn't try to push it
all in in a single mutation, nor should you try to store it using a
scheme that uses large keys.

You can even still store the whole raw file in Accumulo, particularly
if you chunk it up across multiple entries, but you may have to have a
2-stage lookup, where you get intermediate results in the middle, and
you subsequently issue another query for the final result. It seems to
me that this 2-stage lookup would be simple enough to implement as a
client-side tool, provided you get the storage/indexes figured out.

On Thu, Aug 23, 2012 at 5:34 PM, Cardon, Tejay E
<tejay.e.cardon@lmco.com> wrote:
> Thanks Eric,
>
> I was afraid that would be the case.  If I understand you correctly, putting
> a GB file into Accumulo would be a bad idea.  Given that fact, are there any
> strategies available to ensure that a given file in HDFS is co-located with
> the index info for that file in Accumulo? (I would assume not).  In my case,
> I could use Accumulo to store my indexes for fast query, but then have them
> return a URL/URI to the actual file.  However, I have to process each of
> those files further to get to my final result, and I was hoping to do the
> second stage of processing without having to return intermediate results.
> Am I correct in assuming that this can’t be done?
>
>
>
> Thanks,
>
> Tejay
>
>
>
> From: Eric Newton [mailto:eric.newton@gmail.com]
> Sent: Thursday, August 23, 2012 3:06 PM
> To: user@accumulo.apache.org
> Subject: Re: EXTERNAL: Re: Large files in Accumulo
>
>
>
> An entire mutation needs to fit in memory several times, so you should not
> attempt to push in a single mutation larger than a 100MB unless you have a
> lot of memory in your tserver/logger.
>
>
>
> And while I'm at it, large keys will create large indexes, so try to keep
> your (row,cf,cq,cv) under 100K.
>
>
>
> -Eric
>
> On Thu, Aug 23, 2012 at 4:37 PM, Cardon, Tejay E <tejay.e.cardon@lmco.com>
> wrote:
>
> In my case I’ll be doing a document based index store (like the wikisearch
> example), but my documents may be as large as several GB.  I just wanted to
> pick the collective brain of the group to see if I’m walking into a major
> headache.  If it’s never been tried before, then I’ll give it a shot and
> report back.
>
>
> Tejay
>
>
>
> From: William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
> Sent: Thursday, August 23, 2012 2:07 PM
> To: user@accumulo.apache.org
> Subject: EXTERNAL: Re: Large files in Accumulo
>
>
>
> Are these RFiles as a whole? I know at some point HBase needed to have
> entire rows fit into memory; Accumulo does not have this restriction.
>
> On Thu, Aug 23, 2012 at 12:55 PM, Cardon, Tejay E <tejay.e.cardon@lmco.com>
> wrote:
>
> Alright, this one’s a quick question.  I’ve been told that HBase does not
> perform well if large (> 100MB) files are stored in it).  Does Accumulo have
> similar trouble?  If so, can it be overcome by storing the large files in
> their own locality group?
>
>
>
> Thanks,
>
> Tejay
>
>
>
>

Mime
View raw message