hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
Date Fri, 10 Sep 2010 18:19:32 GMT

     [ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1605:
----------------------------

    Description: 
In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603]
is trying to solve the problem by adding a LOScalar operator. Here is a different approach.
We will add a soft link to the plan, and soft link is only visible to the walkers. By doing
this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which
use the scalar. All other part of the logical plan does not know the existence of the soft
link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner
2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline.
In scalar, the dependency means an operator depends on a file generated by the other operator.
It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce another UDF
dependent on a file generated by another operator, we can use this mechanism to solve it.


Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a store in the same
script. This happens in multi-store case. Currently we solve it by regular link. It is better
to use a soft link.

  was:
In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603]
is trying to solve the problem by adding a LOScalar operator. Here is a different approach.
We will add a soft link to the plan, and soft link is only visible to the walkers. All other
part of the logical plan does not know the existence of the soft link. The benefits are:

1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner
2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline.
In scalar, the dependency means an operator depends on a file generated by the other operator.
It's different type of data dependency.
3. Soft link can solve other dependency problem in the future. If we introduce another UDF
dependent on a file generated by another operator, we can use this mechanism to solve it.


Currently, there are two cases we can use soft link:
1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
2. store-load dependency, where we will load a file which is generated by a store in the same
script. This happens in multi-store case. Currently we solve it by regular link. It is better
to use a soft link.


> Adding soft link to plan to solve input file dependency
> -------------------------------------------------------
>
>                 Key: PIG-1605
>                 URL: https://issues.apache.org/jira/browse/PIG-1605
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.0
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.8.0
>
>
> In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603]
is trying to solve the problem by adding a LOScalar operator. Here is a different approach.
We will add a soft link to the plan, and soft link is only visible to the walkers. By doing
this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which
use the scalar. All other part of the logical plan does not know the existence of the soft
link. The benefits are:
> 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner
> 2. Conceptually scalar dependency is different. Regular link represent a data flow in
pipeline. In scalar, the dependency means an operator depends on a file generated by the other
operator. It's different type of data dependency.
> 3. Soft link can solve other dependency problem in the future. If we introduce another
UDF dependent on a file generated by another operator, we can use this mechanism to solve
it. 
> Currently, there are two cases we can use soft link:
> 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore
> 2. store-load dependency, where we will load a file which is generated by a store in
the same script. This happens in multi-store case. Currently we solve it by regular link.
It is better to use a soft link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message