pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)
Date Thu, 02 Jan 2014 17:45:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860392#comment-13860392

Cheolsoo Park commented on PIG-3642:

[~azaroth], thank you for raising a concern. But I still think we should commit this patch
for the following reasons-

# Fetch optimization happens after physical plan is fully built. If the plan is fetchable
(i.e. meets all the conditions Lorand listed in the description), Pig will launch a job via
FetchLauncher instead via MapReduceLauncher. Given this code path, I think the possibility
of introducing a weird optimization bug is minimal. In addition, the optimization is only
applicable to fairly small queries.
# There are indeed changes to some backend operators such as POStream. This is because the
logic about when to pull data from pipeline is different in some cases. But these changes
are fairly minimal too.
# IMO, the benefit of this optimization is big. I am constantly asked by users about this
feature. True that it won't improve any performance of production ETL jobs, but it will shorten
development iteration. In addition, launching a full MR job for a simple load/dump query definitely
makes a bad impression to new users.

> Direct HDFS access for small jobs (fetch) 
> ------------------------------------------
>                 Key: PIG-3642
>                 URL: https://issues.apache.org/jira/browse/PIG-3642
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>             Fix For: 0.13.0
>         Attachments: PIG-3642.patch
> With this patch I'd like to add the possibility to directly read data from HDFS instead
of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch).
This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off
when the following holds for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested)
FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this "optimization"
is for launching small/simple scripts during development, rather than querying and filtering
large number of rows on the client machine. However, a threshold could be given on the input
size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's
'{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic

This message was sent by Atlassian JIRA

View raw message