lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <nicolas.lale...@anyware-tech.com>
Subject Re: Payloads
Date Wed, 20 Dec 2006 15:38:45 GMT
Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
> Hi Michael,
>
> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>
> I am planning on starting on this soon (I know, I have been saying
> that for a while, but I really am.)  At any rate, another set of eyes
> would be good and I would be interested in hearing how your version
> compares/works with this patch from Nicolas.

In fact the work I have done is more about the storing part of Lucene than the 
indexing part. But I think that the mechanism of defining in Java 
an "IndexFormat" I have introduced in my patch will be usefull in defining 
how the payload should be read and wrote.

About my patch, it needs to be synchronized with the current trunk. I will 
update it soon. It just need some clean up.

Nicolas

>
> -Grant
>
> On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
> > Hi all,
> >
> > currently it is not possible to add generic payloads to a posting
> > list. However, this feature would be useful for various use cases.
> > Some examples:
> > - XML search
> >  to index XML documents and allow structured search (e.g. XPath) it
> > is neccessary to store the depths of the terms
> > - part-of-speech
> >  payloads can be used to store the part of speech of a term occurrence
> > - term boost
> >  for terms that occur e.g. in bold font a payload containing a
> > boost value can be stored
> > - ...
> >
> > The feature payloads has been requested and discussed a couple of
> > times, e. g. in
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
> >
> > In the latter thread I proposed a design a couple of months ago
> > that adds the possibility to Lucene to store variable-length
> > payloads inline in the posting list of a term. However, this design
> > had some drawbacks: the already complex field API was extended and
> > the payloads encoding was not optimal in terms of disk space.
> > Furthermore, the overall Lucene runtime performance suffered due to
> > the growth of the .prx file. In the meantime the patch LUCENE-687
> > (Lazy skipping on proximity file) was committed, which reduces the
> > number of reads and seeks on the .prx file. This minimizes the
> > performance degradation of a bigger .prx file. Also, LUCENE-695
> > (Improve BufferedIndexInput.readBytes() performance) was committed,
> > that speeds up reading mid-size chunks of bytes, which is
> > beneficial for payloads that are bigger than just a few bytes.
> >
> > Some weeks ago I started working on an improved design which I
> > would like to propose now. The new design simplifies the API
> > extensions (the Field API remains unchanged) and uses less disk
> > space in most use cases. Now there are only two classes that get
> > new methods:
> > - Token.setPayload()
> >  Use this method to add arbitrary metadata to a Token in the form
> > of a byte[] array.
> > - TermPositions.getPayload()
> >  Use this method to retrieve the payload of a term occurrence.
> > The implementation is very flexible: the user does not have to
> > enable payloads explicilty for a field and can add payloads to all,
> > some or no Tokens. Due to the improved encoding those use cases are
> > handled efficiently in terms of disk space.
> >
> > Another thing I would like to point out is that this feature is
> > backwards compatible, meaning that the file format only changes if
> > the user explicitly adds payloads to the index. If no payloads are
> > used, all data structures remain unchanged.
> >
> > I'm going to open a new JIRA issue soon containing the patch and
> > details about implementation and file format changes.
> >
> > One more comment: It is a rather big patch and this is the initial
> > version, so I'm sure there will be a lot of discussions. I would
> > like to encourage people who consider this feature as useful to try
> > it out and give me some feedback about possible improvements.
> >
> > Best regards,
> > - Michael
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message