hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sina Samangooei ...@ecs.soton.ac.uk>
Subject Re: Generating an Index for sequence files
Date Mon, 04 Oct 2010 09:03:52 GMT

Thanks for the Quick response.

It's good that there are provisions being made for the kind of problem  
i'm trying to solve. However, I can't seem to find any sort of  
TFileInputFormat or MapFileInputFormat. Does this mean TFiles and  
MapFiles can't be simultaneously used for random access as well as map  
reduce tasks?

If this is the case TFiles and MapFiles are not suitable for my  
purposes. I require the ability to perform large scale map-reduce  
operations on ALL of the files, while at the same time having the  
ability to quickly access an individual file. Two separate use cases,  
but both quite important. An option might be to duplicate the data?  
Literally hold two copies, but that just doesn't sit right.

Therefore, for now at least, i will continue with my index generation  
scheme, i think i've found a work around that involves generating the  
index outside of hadoop (i.e. not through a map-reduce task). This is  
slightly slower than generating the index as part of a map reduce  
task, but once generated the index should make access of files and  
various other operations much faster

Thanks again,

- Sina
On 2 Oct 2010, at 17:36, Owen O'Malley wrote:

> On Sat, Oct 2, 2010 at 5:25 AM, Harsh J <qwertymaniac@gmail.com>  
> wrote:
>> Maybe you should take a look at the TFile classes?
> The TFiles give you the meta information you want including row counts
> and an index that is integrated with the compression. The only
> downside is that you'll need to handle the serialization yourself,
> because TFiles only handle binary data. I'm working on a patch that
> include OFiles, which are TFiles that include serialization. The patch
> also includes support for Writables, Avro, ProtocolBuffers, and Thrift
> in SequenceFiles, MapFiles, and OFiles. (See
> https://issues.apache.org/jira/browse/HADOOP-6685 .)
> MapFiles are SequenceFiles with an index. (They are actually are
> implemented as two SequenceFiles, one as the index (key and position)
> and one as the data (key and value). MapFiles don't record the number
> of rows.
> -- Owen

View raw message