lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tricia Williams <pgwil...@uwaterloo.ca>
Subject Payloads, Tokenizers, and Filters. Oh My!
Date Fri, 16 Nov 2007 23:03:08 GMT
Hi All,

    I'll explain what I'm working on, and then I'll ask my two questions.

    I'm working on the issue 
https://issues.apache.org/jira/browse/SOLR-380 which is a feature 
request that allows one to index a "Structured Document" which is 
anything that can be represented by XML in order to provide more context 
to hits in the result set.  This allows us to do things like query the 
index for "Canada" and be able to not only say that that query matched a 
document titled "Some Nonsense" but also that the query term appeared on 
page 7 of chapter 1.  We can then take this one step further and 
markup/highlight the image of this page based on our OCR and position hit. 

For example:

<book title='Some Nonsense'><chapter title='One'><page name='1'>Some 
text from page one of a book.</page><page name='7'>Some more text from 
page seven of a book. Oh and I'm from Canada.</page></chapter></book>

    I accomplished this by creating a custom Tokenizer which strips the 
xml elements and stores them as a Payload at each of the Tokens created 
from the character data in the input.  The payload is the string that 
describes the XPath at that location.  So for <Canada> the payload is 
"/book[title='Some Nonsense']/chapter[title='One']/page[name='7']"

    The other part of this work is the SolrHighlighter which is less 
important to this list.  I retrieve the TermPositions for the Query's 
Terms and use the TermPosition functionality to get back the payload for 
the hits and build output which shows hit positions categorized by the 
payload they are associated with.

QUESTION 1:  Applying TokenFilters to my Tokenizer creates some strange 
(in my opinion) behavior.  First of all the TermPositions change and 
second the Payload is removed.  Is this the expected behavior, or is 
this a bug?  With the Payload being an "experimental feature" I can 
understand if this persistence just hasn't been implemented yet.  But is 
it, or will it be?

In the following example I will denote a token by {pos,<term 
text>,<payload>}:

input: <class name='mammalia'>Dog, and Cat</class>

XmlPayloadTokenizer:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}
StopFilter:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}
WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}


QUESTION 2:  As I explained I'm storing the String representing the 
XPath of the token as the Payload (well the ByteArray of the String) of 
each token.   Is there a more efficient way to do this?  Is this 
exploiting Payload functionality and will it turn around and bite me 
when I get to indexing hundreds of thousands of documents?  Perhaps I 
shouldn't be relying on the Payload functionality before it is deemed 
not experimental?

I feel these questions are both related to Lucene proper rather than 
Solr, which is why I've posted here.  If you think solr-user is a better 
place to post my questions let me know.

Thanks for your input!
Tricia

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message