crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Avoid Multiple HBase Scan
Date Sat, 03 May 2014 04:55:11 GMT
Hi Jinal,

Would it be possible to dump the dot file of the job plan in both
situations (i.e. where it runs as a single job and as multiple jobs)
and post them somewhere?

You can get the job's dot file by calling
pipeline.getConfiguration().get(PlanningParameters.PIPELINE_PLAN_DOTFILE),
or it's also accessible via a PipelineExecution if you're doing
asynchronous execution.

- Gabriel


On Fri, May 2, 2014 at 10:32 PM, Jinal Shah <jinalshah2007@gmail.com> wrote:
> Gabriel,
>
> This is what is happening here
>
> PTable<ImmutableBytesWritable, Result> rawResults =
> pipeline.read(newHBaseSourceTarget(table, scans[0]));
>
> and then a parallelDo to convert from rawResults to PTable<Key, Value>
> that's it.
>
>
> For the other question we are doing a implicit sorting by date during the
> splitting in the map function itself. If that's what you are looking for
> other than that no. But we are not doing any library calls for this sort.
> This is what I understood so if I'm wrong let me know.
>
>
> Thanks
>
> Jinal
>
>
> On Fri, May 2, 2014 at 3:10 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
>
>> Hi Jinal,
>>
>> Could you give a bit more info on what exactly is going on in the call
>> to convertingDataIntoPTable? Is a Sort library call used anywhere in
>> your pipeline?
>>
>> - Gabriel
>>
>> On Fri, May 2, 2014 at 8:48 PM, Jinal Shah <jinalshah2007@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I'm working on something and there is a strange behavior that is
>> happening
>> > and not sure if it should happen that way or not. Here is what I'm doing
>> >
>> > There are 2 HBase tables I'm scanning and both are going through the same
>> > process. 1 tables has less data and the other has comparatively large
>> > amount of data and here is the process
>> >
>> > *--- Start Processing*
>> >
>> > Scan[] data = comingFromHbase();
>> >
>> > //Now I'm converting this into a PTable
>> > PTable<Key, Value> table = convertingDataIntoPTable(data);
>> >
>> > table.cache(); //Expecting this to cache and not reading it again from
>> > HBase for further processing
>> >
>> > //Now I'm splitting the data into 2 parts
>> > PCollection<Pair<PTable<Key,Value>,PTable<Key,Value>> split
=
>> > someFunctionDoingTheSplitOnTableData(table);
>> >
>> > //There are further operation on the split
>> >
>> > *--- END Processing*
>> >
>> > Now for the above process the expected behavior is that if it has to read
>> > it will read it from the cache as oppose to reading from HBase directly
>> if
>> > processes work in parallel.
>> >
>> > This behavior is seen for the table with less data and the whole process
>> > works as expected but for the large table it is starting two different
>> jobs
>> > in parallel and reading from Hbase twice separately for each job and
>> doing
>> > the processing separately.
>> >
>> > Any idea why this is happening?
>> >
>> > Thanks
>> > Jinal
>>

Mime
View raw message