crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Avoid Multiple HBase Scan
Date Fri, 02 May 2014 20:10:54 GMT
Hi Jinal,

Could you give a bit more info on what exactly is going on in the call
to convertingDataIntoPTable? Is a Sort library call used anywhere in
your pipeline?

- Gabriel

On Fri, May 2, 2014 at 8:48 PM, Jinal Shah <jinalshah2007@gmail.com> wrote:
> Hi,
>
> I'm working on something and there is a strange behavior that is happening
> and not sure if it should happen that way or not. Here is what I'm doing
>
> There are 2 HBase tables I'm scanning and both are going through the same
> process. 1 tables has less data and the other has comparatively large
> amount of data and here is the process
>
> *--- Start Processing*
>
> Scan[] data = comingFromHbase();
>
> //Now I'm converting this into a PTable
> PTable<Key, Value> table = convertingDataIntoPTable(data);
>
> table.cache(); //Expecting this to cache and not reading it again from
> HBase for further processing
>
> //Now I'm splitting the data into 2 parts
> PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split =
> someFunctionDoingTheSplitOnTableData(table);
>
> //There are further operation on the split
>
> *--- END Processing*
>
> Now for the above process the expected behavior is that if it has to read
> it will read it from the cache as oppose to reading from HBase directly if
> processes work in parallel.
>
> This behavior is seen for the table with less data and the whole process
> works as expected but for the large table it is starting two different jobs
> in parallel and reading from Hbase twice separately for each job and doing
> the processing separately.
>
> Any idea why this is happening?
>
> Thanks
> Jinal

Mime
View raw message