hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Angeles <patr...@cloudera.com>
Subject Re: Efficiently Stream into Sequence Files?
Date Mon, 15 Mar 2010 17:36:55 GMT

The code you have below should work, provided that the 'outputPath' points
to an HDFS file. The trick is to get FTP/SCP access to the remote files
using a Java client and receive the contents into a byte buffer.You can then
set that byte buffer into your BytesWritable and call writer.append().

On Fri, Mar 12, 2010 at 9:22 AM, Scott Whitecross <scott@dataxu.com> wrote:

> Hi -
> I'd like to create a job that pulls small files from a remote server (using
> FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking
> at the sequence file APi, I don't see an obvious way to do this.  It looks
> like what I have to do is pull the remote file to disk, then read the file
> into memory to place in the sequence file.  Is there a better way?
> Looking at the API, am I forced to use the append method?
>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new
> Path(outputPath));
>            writer = SequenceFile.createWriter(context.getConfiguration(),
> outputStream, Text.class, BytesWritable.class, null, null);
>           // read in file to remotefilebytes
>            writer.append(filekey, remotefilebytes);
> The alternative would be to have one job pull the remote files, and a
> secondary job write them into sequence files.
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
> Thanks.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message