hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zak Stone <zst...@gmail.com>
Subject Re: Efficiently Stream into Sequence Files?
Date Mon, 15 Mar 2010 15:15:42 GMT
Well, do consider buffering a set of files in however much memory you
have for each map task and then waiting for Hadoop to stream some into
a SequenceFile before you download more. Your tasks can work with
batches of files that are small enough to fit in memory but large
enough to avoid download latency and to allow Hadoop to be writing
constantly.

Depending on your local filesystem and how many small files you have,
it could be very inefficient to write small files to the local disk
and then open them all again later.

Zak


On Mon, Mar 15, 2010 at 9:44 AM, Scott Whitecross <scott@dataxu.com> wrote:
> I could, however, the "small" files could grow beyond what I want to allocate memory
for.  I could drop the files to disk, and load them as well in the job, but that seems less
efficient then just saving the files and processing with a secondary job to create sequence
files.
>
> Thanks.
>
> On Mar 12, 2010, at 2:20 PM, Zak Stone wrote:
>
>> Why not write a Hadoop map task that fetches the remote files into
>> memory and then emits them as key-value pairs into a SequenceFile?
>>
>> Zak
>>
>>
>> On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <scott@dataxu.com> wrote:
>>> Hi -
>>>
>>> I'd like to create a job that pulls small files from a remote server (using FTP,
SCP, etc.) and stores them directly to sequence files on HDFS.  Looking at the sequence file
APi, I don't see an obvious way to do this.  It looks like what I have to do is pull the
remote file to disk, then read the file into memory to place in the sequence file.  Is there
a better way?
>>>
>>> Looking at the API, am I forced to use the append method?
>>>
>>>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>>>            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
>>>            writer = SequenceFile.createWriter(context.getConfiguration(),
outputStream, Text.class, BytesWritable.class, null, null);
>>>
>>>           // read in file to remotefilebytes
>>>
>>>            writer.append(filekey, remotefilebytes);
>>>
>>>
>>> The alternative would be to have one job pull the remote files, and a secondary
job write them into sequence files.
>>>
>>> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>
>

Mime
View raw message