lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Korcynski <Mike.Korcyn...@tufts.edu>
Subject Re: exact match on a stored / tokenized field.
Date Thu, 14 May 2009 02:22:38 GMT
Ted,

Thanks, there's some limitations in how I can index documents that would 
strangely make approach 1 difficult, I'm working against a generic 
abstraction layer written on top of lucene.  Approach 2 sounds like it 
may work, I still fail to fully understand though why I can get results 
for the current chunks but then they fail to match an exact string.  
Regardless of that though, thanks for the help, I will work on 
implementing Approach 2 for my case.

-Mike



Ted Dunning wrote:
> I would rearrange your data a bit.  Approach 1 fits what you say more
> precisely, but approach 2 is probably a better solution for normal
> applications.
>
> Approach 1:
>
> First, I would make one document per page to hold the text for the page.  It
> should also have meta-data that says that the document type is a page, and
> what book it came from.  It might be helpful to replicate all of the book
> information onto each page in the book, depending on what kind of combined
> searches you want to do.
>
> Then I would probably make a single document for each book as well, but
> without the text.  This would be useful for book level searches that do not
> mention text at all.
>
> Searching for books that have a phrase on a page would consist of searching
> for pages and grouping results by book.  If you denormalized your data by
> putting book meta-data like author on each page, then searching for books by
> an author that mention a phrase would be easy as well.  This approach would
> not find phrase that span a page boundary.
>
> Approach 2:
>
> Put all of the text for an entire book into a document and insert markers
> into the text to indicate page number.  This could be done by special tokens
> that take up no space or could be done using the payload capability.  You
> could also store the word offsets of page boundaries in a side field.  When
> you find books that match, you would have to adapt normal highlighting
> software to convert references into page numbers.  Using the page offset
> data, you could extract just the text for a single page pretty easily for
> display purposes.
>
> Searching for books that have a phrase on a page would consist of searching
> for phrase hits in the books and filtering by bogus hits that have the
> phrase split across a page.  If you don't care about page breaks in phrases,
> then you wouldn't have to do that.
>
> Approach 3:
>
> Use approach 2, but use a side file containing page texts in an accessible
> form.  Searching would go against the book-level index, displaying would go
> against the page level data store.
>
>
> Does this help?
>
> On Wed, May 13, 2009 at 6:11 PM, Mike Korcynski <Mike.Korcynski@tufts.edu>wrote:
>
>   
>> These fields are part of a single document, for this purpose I'd consider a
>> chunk to be a page in a book.  I have many books but they all have different
>> numbers of pages.  Say though for this purpose no book has more then 500
>> pages.  They also have a bunch of standard metadata fields.  That is how the
>> original documents have been deconstructed.
>>
>> I want to search the books for the exact term "Latin School" and return the
>> pages in those books that contain the string.  Searching generally works
>> fine, for instance I can say return me all books where the field author is
>> Mark Twain and contains the text Blah, and that'll work. But, when I try to
>> do a quoted string for an exact match, or for that matter a proximity match
>> it doesn't work, it returns pages containing Latin OR School instead of
>> returning the exact match.
>>
>> -Mike
>>
>>
>>
>> Ted Dunning wrote:
>>
>>     
>>> It seems to me that you have defined your fields a bit oddly.
>>>
>>> Fields are normally part of a single document and are there to facilitate
>>> searching on a part of a document such as a title.  In some cases, fields
>>> are used to store different versions of a part of a document so that you
>>> can
>>> recover the exact original text, but still index a transformed version of
>>> the text as with stemming.
>>>
>>> In your case, it is easy to pose a query that searches for all documents
>>> that have the phrase "Latin School" in the chunk.2 field.  This becomes
>>> very
>>> much more difficult if you don't have uniformity between documents in
>>> terms
>>> of which fields exist.  If all documents have chunk.2 and chunk.12 fields,
>>> then it would be easy to pose two queries, one that searches for all
>>> documents that match because of chunk.2, and one that searches for all
>>> documents that match by virtue of chunk.12.
>>>
>>> I suspect, however, that the way that you have constructed your documents
>>> will make this impossible.
>>>
>>> Is it possible to step back a bit and describe how you have deconstructed
>>> your original documents and why?
>>>
>>> On Wed, May 13, 2009 at 7:05 AM, Mike Korcynski <Mike.Korcynski@tufts.edu
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> Hi,
>>>>
>>>> I have fields that are stored and tokenized, I've indexed using the
>>>> StandardAnalyzer.  Now I'm trying to do an exact string match.  For
>>>> example,
>>>> my document has two fields:
>>>>
>>>> chunk.12     rights regarding immigration. Unlike other Latin Americans,
>>>> Puerto Ricans are US. citizens. The right
>>>> chunk.2       the Latin School for collaborating with us, especially
>>>> Maira
>>>> Perez and Melissa Lee. They have
>>>>
>>>> I want to do an exact string search for "Latin School" and have it return
>>>> me chunk.2 as part of the results but not chunk.12.  Now, it would seem
>>>> that
>>>> this wouldn't be possible because of the tokenization.  So my initial
>>>> inclination was to store the fields as both tokenized and untokenized so
>>>> that I could do an exact match against the untokenized fields.  However
>>>> since wild card searches can't start with *, I can't do *Latin School*
>>>> and
>>>> so I can't figure out how I'd get chunk.2 to return when they're
>>>> untokenized?   Is there a best practice or a generic deisgn pattern to
>>>> follow for setting up for your index to allow for exact searching?
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>
>>>
>>>       
>>     
>
>
>   


Mime
View raw message