crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinal Shah <>
Subject Avoid Multiple HBase Scan
Date Fri, 02 May 2014 18:48:49 GMT

I'm working on something and there is a strange behavior that is happening
and not sure if it should happen that way or not. Here is what I'm doing

There are 2 HBase tables I'm scanning and both are going through the same
process. 1 tables has less data and the other has comparatively large
amount of data and here is the process

*--- Start Processing*

Scan[] data = comingFromHbase();

//Now I'm converting this into a PTable
PTable<Key, Value> table = convertingDataIntoPTable(data);

table.cache(); //Expecting this to cache and not reading it again from
HBase for further processing

//Now I'm splitting the data into 2 parts
PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split =

//There are further operation on the split

*--- END Processing*

Now for the above process the expected behavior is that if it has to read
it will read it from the cache as oppose to reading from HBase directly if
processes work in parallel.

This behavior is seen for the table with less data and the whole process
works as expected but for the large table it is starting two different jobs
in parallel and reading from Hbase twice separately for each job and doing
the processing separately.

Any idea why this is happening?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message