crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinal Shah <jinalshah2...@gmail.com>
Subject Re: Avoid Multiple HBase Scan
Date Fri, 02 May 2014 20:32:35 GMT
Gabriel,

This is what is happening here

PTable<ImmutableBytesWritable, Result> rawResults =
pipeline.read(newHBaseSourceTarget(table, scans[0]));

and then a parallelDo to convert from rawResults to PTable<Key, Value>
that's it.


For the other question we are doing a implicit sorting by date during the
splitting in the map function itself. If that's what you are looking for
other than that no. But we are not doing any library calls for this sort.
This is what I understood so if I'm wrong let me know.


Thanks

Jinal


On Fri, May 2, 2014 at 3:10 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Hi Jinal,
>
> Could you give a bit more info on what exactly is going on in the call
> to convertingDataIntoPTable? Is a Sort library call used anywhere in
> your pipeline?
>
> - Gabriel
>
> On Fri, May 2, 2014 at 8:48 PM, Jinal Shah <jinalshah2007@gmail.com>
> wrote:
> > Hi,
> >
> > I'm working on something and there is a strange behavior that is
> happening
> > and not sure if it should happen that way or not. Here is what I'm doing
> >
> > There are 2 HBase tables I'm scanning and both are going through the same
> > process. 1 tables has less data and the other has comparatively large
> > amount of data and here is the process
> >
> > *--- Start Processing*
> >
> > Scan[] data = comingFromHbase();
> >
> > //Now I'm converting this into a PTable
> > PTable<Key, Value> table = convertingDataIntoPTable(data);
> >
> > table.cache(); //Expecting this to cache and not reading it again from
> > HBase for further processing
> >
> > //Now I'm splitting the data into 2 parts
> > PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split
=
> > someFunctionDoingTheSplitOnTableData(table);
> >
> > //There are further operation on the split
> >
> > *--- END Processing*
> >
> > Now for the above process the expected behavior is that if it has to read
> > it will read it from the cache as oppose to reading from HBase directly
> if
> > processes work in parallel.
> >
> > This behavior is seen for the table with less data and the whole process
> > works as expected but for the large table it is starting two different
> jobs
> > in parallel and reading from Hbase twice separately for each job and
> doing
> > the processing separately.
> >
> > Any idea why this is happening?
> >
> > Thanks
> > Jinal
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message