lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Korcynski <Mike.Korcyn...@tufts.edu>
Subject Re: exact match on a stored / tokenized field.
Date Thu, 14 May 2009 01:11:02 GMT
These fields are part of a single document, for this purpose I'd 
consider a chunk to be a page in a book.  I have many books but they all 
have different numbers of pages.  Say though for this purpose no book 
has more then 500 pages.  They also have a bunch of standard metadata 
fields.  That is how the original documents have been deconstructed.

I want to search the books for the exact term "Latin School" and return 
the pages in those books that contain the string.  Searching generally 
works fine, for instance I can say return me all books where the field 
author is Mark Twain and contains the text Blah, and that'll work. But, 
when I try to do a quoted string for an exact match, or for that matter 
a proximity match it doesn't work, it returns pages containing Latin OR 
School instead of returning the exact match.

-Mike


Ted Dunning wrote:
> It seems to me that you have defined your fields a bit oddly.
>
> Fields are normally part of a single document and are there to facilitate
> searching on a part of a document such as a title.  In some cases, fields
> are used to store different versions of a part of a document so that you can
> recover the exact original text, but still index a transformed version of
> the text as with stemming.
>
> In your case, it is easy to pose a query that searches for all documents
> that have the phrase "Latin School" in the chunk.2 field.  This becomes very
> much more difficult if you don't have uniformity between documents in terms
> of which fields exist.  If all documents have chunk.2 and chunk.12 fields,
> then it would be easy to pose two queries, one that searches for all
> documents that match because of chunk.2, and one that searches for all
> documents that match by virtue of chunk.12.
>
> I suspect, however, that the way that you have constructed your documents
> will make this impossible.
>
> Is it possible to step back a bit and describe how you have deconstructed
> your original documents and why?
>
> On Wed, May 13, 2009 at 7:05 AM, Mike Korcynski <Mike.Korcynski@tufts.edu>wrote:
>
>   
>> Hi,
>>
>> I have fields that are stored and tokenized, I've indexed using the
>> StandardAnalyzer.  Now I'm trying to do an exact string match.  For example,
>> my document has two fields:
>>
>> chunk.12     rights regarding immigration. Unlike other Latin Americans,
>> Puerto Ricans are US. citizens. The right
>> chunk.2       the Latin School for collaborating with us, especially Maira
>> Perez and Melissa Lee. They have
>>
>> I want to do an exact string search for "Latin School" and have it return
>> me chunk.2 as part of the results but not chunk.12.  Now, it would seem that
>> this wouldn't be possible because of the tokenization.  So my initial
>> inclination was to store the fields as both tokenized and untokenized so
>> that I could do an exact match against the untokenized fields.  However
>> since wild card searches can't start with *, I can't do *Latin School* and
>> so I can't figure out how I'd get chunk.2 to return when they're
>> untokenized?   Is there a best practice or a generic deisgn pattern to
>> follow for setting up for your index to allow for exact searching?
>>
>> Any help would be appreciated.
>>
>> Thanks,
>>
>> Mike
>>
>>
>>
>>     
>
>
>   


Mime
View raw message