lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: exact match on a stored / tokenized field.
Date Thu, 14 May 2009 01:27:44 GMT
I would rearrange your data a bit.  Approach 1 fits what you say more
precisely, but approach 2 is probably a better solution for normal
applications.

Approach 1:

First, I would make one document per page to hold the text for the page.  It
should also have meta-data that says that the document type is a page, and
what book it came from.  It might be helpful to replicate all of the book
information onto each page in the book, depending on what kind of combined
searches you want to do.

Then I would probably make a single document for each book as well, but
without the text.  This would be useful for book level searches that do not
mention text at all.

Searching for books that have a phrase on a page would consist of searching
for pages and grouping results by book.  If you denormalized your data by
putting book meta-data like author on each page, then searching for books by
an author that mention a phrase would be easy as well.  This approach would
not find phrase that span a page boundary.

Approach 2:

Put all of the text for an entire book into a document and insert markers
into the text to indicate page number.  This could be done by special tokens
that take up no space or could be done using the payload capability.  You
could also store the word offsets of page boundaries in a side field.  When
you find books that match, you would have to adapt normal highlighting
software to convert references into page numbers.  Using the page offset
data, you could extract just the text for a single page pretty easily for
display purposes.

Searching for books that have a phrase on a page would consist of searching
for phrase hits in the books and filtering by bogus hits that have the
phrase split across a page.  If you don't care about page breaks in phrases,
then you wouldn't have to do that.

Approach 3:

Use approach 2, but use a side file containing page texts in an accessible
form.  Searching would go against the book-level index, displaying would go
against the page level data store.


Does this help?

On Wed, May 13, 2009 at 6:11 PM, Mike Korcynski <Mike.Korcynski@tufts.edu>wrote:

> These fields are part of a single document, for this purpose I'd consider a
> chunk to be a page in a book.  I have many books but they all have different
> numbers of pages.  Say though for this purpose no book has more then 500
> pages.  They also have a bunch of standard metadata fields.  That is how the
> original documents have been deconstructed.
>
> I want to search the books for the exact term "Latin School" and return the
> pages in those books that contain the string.  Searching generally works
> fine, for instance I can say return me all books where the field author is
> Mark Twain and contains the text Blah, and that'll work. But, when I try to
> do a quoted string for an exact match, or for that matter a proximity match
> it doesn't work, it returns pages containing Latin OR School instead of
> returning the exact match.
>
> -Mike
>
>
>
> Ted Dunning wrote:
>
>> It seems to me that you have defined your fields a bit oddly.
>>
>> Fields are normally part of a single document and are there to facilitate
>> searching on a part of a document such as a title.  In some cases, fields
>> are used to store different versions of a part of a document so that you
>> can
>> recover the exact original text, but still index a transformed version of
>> the text as with stemming.
>>
>> In your case, it is easy to pose a query that searches for all documents
>> that have the phrase "Latin School" in the chunk.2 field.  This becomes
>> very
>> much more difficult if you don't have uniformity between documents in
>> terms
>> of which fields exist.  If all documents have chunk.2 and chunk.12 fields,
>> then it would be easy to pose two queries, one that searches for all
>> documents that match because of chunk.2, and one that searches for all
>> documents that match by virtue of chunk.12.
>>
>> I suspect, however, that the way that you have constructed your documents
>> will make this impossible.
>>
>> Is it possible to step back a bit and describe how you have deconstructed
>> your original documents and why?
>>
>> On Wed, May 13, 2009 at 7:05 AM, Mike Korcynski <Mike.Korcynski@tufts.edu
>> >wrote:
>>
>>
>>
>>> Hi,
>>>
>>> I have fields that are stored and tokenized, I've indexed using the
>>> StandardAnalyzer.  Now I'm trying to do an exact string match.  For
>>> example,
>>> my document has two fields:
>>>
>>> chunk.12     rights regarding immigration. Unlike other Latin Americans,
>>> Puerto Ricans are US. citizens. The right
>>> chunk.2       the Latin School for collaborating with us, especially
>>> Maira
>>> Perez and Melissa Lee. They have
>>>
>>> I want to do an exact string search for "Latin School" and have it return
>>> me chunk.2 as part of the results but not chunk.12.  Now, it would seem
>>> that
>>> this wouldn't be possible because of the tokenization.  So my initial
>>> inclination was to store the fields as both tokenized and untokenized so
>>> that I could do an exact match against the untokenized fields.  However
>>> since wild card searches can't start with *, I can't do *Latin School*
>>> and
>>> so I can't figure out how I'd get chunk.2 to return when they're
>>> untokenized?   Is there a best practice or a generic deisgn pattern to
>>> follow for setting up for your index to allow for exact searching?
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message