lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregor Heinrich <>
Subject Re: Numerical ids for terms?
Date Tue, 12 Apr 2011 16:16:12 GMT
Thanks for the quick response. Please be a bit more concrete than "some form" of 
term--id mapping:  Do you refer to subclassing SegmentReader with the 
appropriate Map implementation or is there a tested structure in the existing 
API that I've overseen? Regarding a Directory abstraction backed by a memory 
mapping API, my question refers to using Lucene API because even if may be 
perceived "dumb", it hides a lot of boilerplate code. Are there any efforts 
going on regarding this?



On 4/12/11 1:21 PM, Earwin Burrfoot wrote:
> On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich<>  wrote:
>> Hi -- has there been any effort to create a numerical representation of
>> Lucene indices. That is, to use the Lucene Directory backend as a large
>> term-document matrix at index level. As this would require bijective mapping
>> between terms (per-field, as customary in Lucene) and a numerical index
>> (integer, monotonous from 0 to numTerms()-1), I guess this requires some
>> some special modifications to the Lucene core.
> Lucene index already provides term<->  id mapping in some form.
>> Another interesting feature would be to use Lucene's Directory backend for
>> storage of large dense matrices, for instance to data-mining tasks from
>> within Lucene.
> Lucene's Directory is a dumb abstraction for random-access named
> write-once byte streams.
> It doesn't add /any/ value over mmap.
>> Any suggestions?
> *troll mode on* Use numpy/scipy? :)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message