pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4853) Fetch inputs before starting outputs
Date Mon, 11 Apr 2016 21:30:25 GMT

    [ https://issues.apache.org/jira/browse/PIG-4853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236005#comment-15236005

Rohini Palaniswamy commented on PIG-4853:

bq. can you explain why the input buffers can be released, we are still in the loop to fetch
inputs, seem input/output buffer has to coexist during the data processing.
   Input buffers are primarily used during fetching and merging of inputs. Once that is done
and merged to disk, it reads from the IFile which does not use much memory.  But in some cases
input can just be kept in memory and read using InMemoryReader. We still have to look into
those scenarios. 

Calling reader.next() on a input, makes the shuffle and merge of the input to be complete
making it possible for those buffers to be released before buffers are allocated for the outputs.

[~jlowe] is yet to create a Tez jira and have the internal patch uploaded. We plan to run
with this changes for a week or two and see how it affects jobs. Most likely this will go
into Tez 0.7.1. 

> Fetch inputs before starting outputs
> ------------------------------------
>                 Key: PIG-4853
>                 URL: https://issues.apache.org/jira/browse/PIG-4853
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>         Attachments: PIG-4853-1.patch
>     Force fetch inputs before starting outputs so that we can choose to allocate more
space for buffers by setting tez.task.scale.memory.input-output-concurrent=false which is
a new option in Tez. With the default value of true, WeightedScalingMemoryDistributor in Tez
for a TezConfiguration.TEZ_TASK_SCALE_MEMORY_RESERVE_FRACTION of 0.5 and 1G memory, will split
the 512MB between inputs and outputs. If set to false, it will allocate 512MB to inputs and
512MB to outputs.  For eg: For two join inputs and one group by output
> tez.task.scale.memory.input-output-concurrent=true
> {code}
> 2016-03-28 01:15:58,842 [INFO] [TezChild] |resources.MemoryDistributor|: Allocations=[scope-32:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:268435456:83684722],
> {code}
> tez.task.scale.memory.input-output-concurrent=false
> {code}
> 2016-03-28 01:25:36,665 [INFO] [TezChild] |resources.MemoryDistributor|: Allocations=[scope-32:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:268435456:268435456],
> {code}
> To ensure we don't hit OOM, we need to finish fetching the inputs by calling reader.next()
before calling output.start(). That will make sure the input buffers are released before output
buffers are allocated. 

This message was sent by Atlassian JIRA

View raw message