hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Figueiredo <p...@89clouds.com>
Subject Re: Splitting data input to Distcp
Date Fri, 04 May 2012 07:55:25 GMT
On 3 May 2012, at 23:47, Himanshu Vijay wrote:

> Pedro,
> 
> Thanks for the response. Unfortunately I am running it on in-house cluster
> and from there I need to upload to S3.
> 

Hi,

Last night I was thinking about this... what happens if you copy

s3://region.elasticmapreduce/libs/s3distcp/1.0.1/s3distcp.jar

to your cluster and run

hadoop jar s3distcp.jar --src hdfs:///path/to/files --dest s3://bucket/path --outputCodec
lzo (or what have you)

?

Alternatively, you could run the following Pig or Hive jobs (using output compression):

--- pig ---
local_data = load '/path/to/files' as ( ... );
store local_data into 's3://bucket/path' using ...;

--- hive ---
create external table foo (
  ...
)
[row format ... | serde]
location '/path/to/files';

create external table s3_foo (
  ...
)
[row format ... | serde]
location 's3://bucket/path';

insert overwrite table s3_foo
select * from foo;

Obviously an equivalent Native or Streaming job is trivial to write, too.

Cheers,

Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting





Mime
View raw message