hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject compile load to mr plan
Date Fri, 25 Jun 2010 13:32:15 GMT
multiple load operators in a script start the same number of streams, some of them are merged
later (e.g. join) and some of them are not. How to know which MR Operator should we place
these loads at? For example, we got script like this:
a = load file1
b = load file2

if we join a and b between loads and dump, the two loads (a and b) should be placed in the
same MR operator. If we sort a and b independently, these two loads should be placed in separate
MR operators. How to identify these two streams are correlated or not?

A further question is, can we specify a directory so that load will read all the files in
that directory? Since each reducer of a mr job will produce a single file, when the subsequent
mr job need to read all these files, what do we do?



View raw message