lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Term Based Meta Data
Date Sat, 09 Aug 2008 17:17:00 GMT
Yeah, unfortunately, they are two distinct things, completely  
unrelated in terms of storage, access, etc.  I've often felt the  
access to position/offset information is hard in Lucene, which makes  
it harder to do things like highlighting, co-occurrence analysis, etc.

LUCENE-1001 (which Mark Miller has revived a bit after I lost time on  
it: https://issues.apache.org/jira/browse/LUCENE-1001) is one step  
towards remedying this.  By using Span Queries and (a working  
version)  of that patch, one should be able to get at payload info  
related to the matches.  Still not perfect, but better.

Purely wondering out loud here, but it may be possible to use the  
TermVectorMapper in conjunction with TermPositions to correlate the  
too, but I am not sure.  The tricky part is you somehow need to  
optimize access to the TermPositions.

Another thought is to add payloads storage to TermVectors.  Now we  
have offsets and positions, it's conceivable to also store payloads in  
a doc. centric way.  The tradeoff being more storage space required.   
Still, if it's what you need, then it's what you need.

-Grant

On Aug 8, 2008, at 10:39 PM, Tricia Williams wrote:

> Hi,
>
>   Following the history of Payloads from its beginnings (https://issues.apache.org/jira/browse/LUCENE-755

> , https://issues.apache.org/jira/browse/LUCENE-761, https://issues.apache.org/jira/browse/LUCENE-834

> , http://wiki.apache.org/lucene-java/Payload_Planning) it looks like  
> TermPostionsVector was never considered as part of the Payload  
> functionality.  I think this is based on the underlying index file  
> structure???  I don't see any way to get at a Payload other than  
> through a TermPositions object.  I don't think there is a way to  
> translate code which uses TermPositions to using TermPositionVector  
> with regards to payloads  -- but I welcome someone to show me how  
> they could.
>
>   It looks like this conversation has happened at least twice  
> before: http://www.nabble.com/Best-Practices-for-getting-Strings-from-a-position-range-td11608273.html#a11609245

>  and http://www.nabble.com/Best-way-to-get-payloads-td17399361.html#a17399361 
> .  Unfortunately it doesn't look like anything came of it.
>
>   Maybe there is some other work around.  What are you trying to  
> accomplish "historically" with TermPositionsVectors instead of  
> TermPositions?
>
> Tricia
>
> Martin Owens wrote:
>> Dear Lucene Users and Tricia Williams,
>>
>> The way we're operating our lucene index is one where we index all  
>> the
>> terms but not store the text. From your SOLR-380 patch example  
>> Tricia I
>> was able to get a very good idea of how to set things up.  
>> Historically I
>> have used TermPositionsVector instead of TermPositions because of  
>> that
>> data is available without storing the text in the index.
>>
>> Is it possible to translate code which uses TermPositions to using
>> TermPositionsVector with regards to payloads?
>>
>> Best Regards, Martin Owens
>>
>> On Tue, 2008-08-05 at 11:14 -0600, Tricia Williams wrote:
>>
>>> Hi Martin,
>>>
>>>    Take a look at what I've done with SOLR-380 (https://issues.apache.org/jira/browse/SOLR-380

>>> ). It might solve your problem, or at least give you a good  
>>> starting point.
>>>
>>> Tricia
>>>
>>> Michael McCandless wrote:
>>>
>>>> I think you could use payloads (= arbitrary/opaque byte[]) for  
>>>> this?
>>>>
>>>> You can attach a payload to each term occurrence during  
>>>> tokenization (indexing), and then retrieve the payload during  
>>>> searching.
>>>>
>>>> Mike
>>>>
>>>> Martin Owens wrote:
>>>>
>>>>
>>>>> Hello Users,
>>>>>
>>>>> I'm working on a project which attempts to store data that comes  
>>>>> from an
>>>>> OCR process which describes the pixel co-ordinates of each term  
>>>>> in the
>>>>> document. It's used for hit highlighting.
>>>>>
>>>>> What I would like to do is store this co-ordinate information  
>>>>> alongside
>>>>> the terms. I know there is existing meta data stored per term  
>>>>> (Word
>>>>> Offset and Char Offsets) the problem is that If I create a  
>>>>> separate
>>>>> index and try and use the word offset or char offsets not only  
>>>>> is it
>>>>> slower but it doesn't match because of the way the terms are  
>>>>> processed
>>>>> both inside of lucene and the OCR program.
>>>>>
>>>>> So, is it possible to store the data alongside the terms in  
>>>>> lucene and
>>>>> then recall them when doing certain searches? and how much  
>>>>> custom code
>>>>> needs to be written to do it?
>>>>>
>>>>> Best Regards, Martin Owens
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message