hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Whitecross <sc...@dataxu.com>
Subject Efficiently Stream into Sequence Files?
Date Fri, 12 Mar 2010 13:22:16 GMT
Hi -

I'd like to create a job that pulls small files from a remote server (using FTP, SCP, etc.)
and stores them directly to sequence files on HDFS.  Looking at the sequence file APi, I don't
see an obvious way to do this.  It looks like what I have to do is pull the remote file to
disk, then read the file into memory to place in the sequence file.  Is there a better way?

Looking at the API, am I forced to use the append method?

            FileSystem hdfs = FileSystem.get(context.getConfiguration());
            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
            writer = SequenceFile.createWriter(context.getConfiguration(), outputStream, Text.class,
BytesWritable.class, null, null);
	   // read in file to remotefilebytes            

            writer.append(filekey, remotefilebytes);

The alternative would be to have one job pull the remote files, and a secondary job write
them into sequence files.    

I'm using the latest Cloudera release, which I believe is Hadoop 20.1


View raw message