lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <soko...@ifactory.com>
Subject Re: More about storing NLP-type stuff in the index
Date Fri, 04 Jan 2013 01:58:26 GMT
On 1/3/2013 6:16 PM, Wu, Stephen T., Ph.D. wrote:
> I think we've been saying that if we put something in a Payload, it will be
> indexed.  From what I understand of the indexing format, that means that
> what you put in the Payload will be stored in the Lucene index... But it
> won't *itself* be indexed & optimized for search.
>
> That's good, but can we build inverted indices on the contents of the
> Payloads (or the Attributes) as well?
>   Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
> is looking for all instances of ARG0.
>   Ex2: Say I add payloads to terms indicating that they're named entities
> belonging to a semantic group.  Then say my query looks for all instances of
> the "Medications" semantic group.
>
> It's almost like just putting these things in different fields, with the
> exception that the things in different fields need to be linked so you know
> what the original text was.  Maybe the linking can be done via Payloads
> (offsets in the original text)?  If I want to store multiple things at the
> same startOffset then I just use something like SynonymFilter?
>
I've been working on a different but (in a way) related problem: 
indexing text in XML documents.  In that case, we want to associate the 
names of enclosing elements with each term so that it's possible to 
search for (say) "ermine" in the context /doc/title as distinct from 
"ermine" in the context of //paragraph, or something like that.  Anyway 
what I've done doesn't use payloads.  I index two fields that are 
relevant to this: a full text field, which is just the usual text index 
(per document), and then an element-text field which indexes each term 
as a concatenation of the element name and the term value, so: 
title:ermine, doc:ermine, and paragraph:ermine would be typical terms.  
I index all of the enclosing element names for each word at the same 
position (like synonym filter does). This relies on a magical character 
(":") that isn't allowed to appear in any tokens, which is too bad, but 
not terribly restrictive.

Something like this might work for you.  The prefixing also has the nice 
feature that when you enumerate terms, they are ordered first by prefix: 
of course you could flip the order if it were more interesting to list 
all "contexts" for a word rather than all words in a context (or with 
some POS tag).

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message