lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: exact match on a stored / tokenized field.
Date Wed, 13 May 2009 18:42:44 GMT
It seems to me that you have defined your fields a bit oddly.

Fields are normally part of a single document and are there to facilitate
searching on a part of a document such as a title.  In some cases, fields
are used to store different versions of a part of a document so that you can
recover the exact original text, but still index a transformed version of
the text as with stemming.

In your case, it is easy to pose a query that searches for all documents
that have the phrase "Latin School" in the chunk.2 field.  This becomes very
much more difficult if you don't have uniformity between documents in terms
of which fields exist.  If all documents have chunk.2 and chunk.12 fields,
then it would be easy to pose two queries, one that searches for all
documents that match because of chunk.2, and one that searches for all
documents that match by virtue of chunk.12.

I suspect, however, that the way that you have constructed your documents
will make this impossible.

Is it possible to step back a bit and describe how you have deconstructed
your original documents and why?

On Wed, May 13, 2009 at 7:05 AM, Mike Korcynski <Mike.Korcynski@tufts.edu>wrote:

> Hi,
>
> I have fields that are stored and tokenized, I've indexed using the
> StandardAnalyzer.  Now I'm trying to do an exact string match.  For example,
> my document has two fields:
>
> chunk.12     rights regarding immigration. Unlike other Latin Americans,
> Puerto Ricans are US. citizens. The right
> chunk.2       the Latin School for collaborating with us, especially Maira
> Perez and Melissa Lee. They have
>
> I want to do an exact string search for "Latin School" and have it return
> me chunk.2 as part of the results but not chunk.12.  Now, it would seem that
> this wouldn't be possible because of the tokenization.  So my initial
> inclination was to store the fields as both tokenized and untokenized so
> that I could do an exact match against the untokenized fields.  However
> since wild card searches can't start with *, I can't do *Latin School* and
> so I can't figure out how I'd get chunk.2 to return when they're
> untokenized?   Is there a best practice or a generic deisgn pattern to
> follow for setting up for your index to allow for exact searching?
>
> Any help would be appreciated.
>
> Thanks,
>
> Mike
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message