hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johan Oskarsson <jo...@oskarsson.nu>
Subject Re: Merge sequence files
Date Tue, 15 May 2007 22:07:44 GMT
Doug Cutting wrote:
> Johan Oskarsson wrote:
>> I'm considering using the sequence file output of hadoop jobs to 
>> serve data from as it would mean I could skip the conversion from 
>> sequence file -> other file format step.
>> To do this efficiently I would need the data to be in one file.
> I think it should be more efficient to keep things in separate files. 
> If you use MapFileOutputFormat, there are methods to randomly access 
> entries from job output:
> http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/MapFileOutputFormat.html

> SequenceFileOutputFormat will also let you open all readers, but 
> there's no random access, since a SequenceFile has no index.
> http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html

> Will these suffice?
> Doug

You're probably right that the best way would be to just leave the files 
as is. I was mostly worried about reaching limits to the number of open 
but did a quick calculation now and I have over estimated how many files 
we would have. I think we'd reach other problems before the open files 
would become an issue.

I have considered using MapFiles, however the key to do lookups on would 
often be different from the key needed when calculating the data and 
when using it as input
in other hadoop programs. For example if the key writable is called 
UserResource I might have to do lookups when serving on just the user id.
I was planning on doing something similar to a MapFile but with the 
addition that I can specify what parts of the key to index on. And just 
as MapFiles
it would be read as a SequenceFile when using it as input in other 
hadoop programs.

Currently we just output everything as text in one big file and index 
that for serving.
It's a simple fixed width index that we use to lookup the start position 
for the data for a user id.
This is of course a big waste of disk space and bandwidth/time.

Thanks for taking the time to answer my questions.


View raw message