hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: MapFile.get() has a bug?
Date Tue, 28 Nov 2006 18:34:04 GMT
Albert Chern wrote:
> Every time the size of the map file hits a multiple of the index
> interval, an index entry is written.  Therefore, it is possible that
> an index entry is not added for the first occurrence of a key, but one
> of the later ones.  The reader will then seek to one of those instead
> of the first.
> 
> This does seem to be inconsistent with the the fact that you are
> allowed to insert equal key records.

Yes, I agree that this is confusing and arguably a bug.

> I suspect perhaps the developers
> meant for MapFile records to be uniquely keyed, but in
> MapFile.Writer.checkKey() they used a > where they intended a >= or
> something.

I think what actually happened was that I originally coded it to 
prohibit equal keys, then, at some point found an application (somewhere 
in Nutch) where equal keys were useful, and changed MapFile to support 
them, not realizing the consequences.  Sigh.  I don't know whether Nutch 
still relies on this or not.

MapFile could probably be fixed by changing the way the index is 
created, to write the location of the first instance of any run of equal 
keys.  We could also avoid recording two instances of equal keys in the 
index: for a long run of equal keys, we could wait until the key changes 
before emitting a new index entry.

Doug

Mime
View raw message