lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
Date Fri, 30 Nov 2012 15:48:42 GMT
"I will probably have to implement my own datastructure and 
parser/tokenizer/stemmer"

Why? I mean, I think the point of the Lucene architecture is that the codec 
level is completely independent of the analysis level.

The end result of analysis is a value to be stored from the application 
perspective, a "logical value" so to speak, but NOT the bit sequence, the 
"physical value" so to speak, that the codec will actually store.

So, go ahead and have your own codec that does whatever it wants with 
values, but the input for storage and query should be the output of a 
standard Lucene analyzer.

-- Jack Krupansky

-----Original Message----- 
From: Johannes.Lichtenberger
Sent: Friday, November 30, 2012 10:15 AM
To: java-user@lucene.apache.org
Cc: Michael McCandless
Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the 
ability to make new postings codecs?

On 11/28/2012 01:11 AM, Michael McCandless wrote:
> Flexible indexing is the ability to make your own codec, which
> controls the reading and writing of all index parts (postings, stored
> fields, term vectors, deleted docs, etc.).
>
> So for example if you want to store some postings as a bit set instead
> of the block format that's the default coming up in 4.1, that's easy
> to do.
>
> But what is less easy (as I described below) is changing what is
> actually stored in the postings, eg adding a new per-position
> attribute.
>
> The original goal was to allow arbitrary attributes beyond the known
> docs/freqs/positions/offsets that Lucene supports today, so that you
> could easily make new application-dependent per-term, per-doc,
> per-position things, pull them from the analyzer, save them to the
> index, and access them from an IndexReader / query, but while some
> APIs do expose this, it's not very well explored yet (eg, you'd have
> to make a custom indexing chain to get the attributes "through"
> IndexWriter down to your codec).  It would be great to make progress
> making this easier, so ideas are very welcome :)

Regarding my questin/thread, is it also possible to change the backend
system? I'd like to use Lucene for a versioned DBMS, thus I would need
the ability to serialize/deserialize the bytes in my backend whereas
keys/values are stored in pages (for instance in an upcoming B+-tree, or
in  simple "unordered" pages via a record-ID/record mapping). But as no
one suggested anything as of now and I've also asked a year ago or so,
after implementing the B+-tree I will probably have to implement my own
datastructure and parser/tokenizer/stemmer... :-(

kind regards,
Johannes


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message