lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Phrase query with terms at same location
Date Thu, 19 Nov 2009 15:30:27 GMT
Ahhh, I should have followed the link. I was interpreting your first note as
emitting two tokens NOT at the same offset. My mistake, ignore my nonsense
about unexpected consequences. Your original assumption is correct, zero
offsets are pretty transparent.

What do you really want to do here? Mark's email (at the link) allows
you to create queries queries expressing "find all phrases
of the form noun-verb-adverb" say. The slop allows for intervening words.

Your original post seems to want different semantics.

<<<I would like to do a search that retrieves documents when a given word is
used with a specific part of speech, e.g. all docs where "report" is used as
a noun>>>.

For that, my suggestion seems simpler, which is not surprising since it
addresses a less general problem. So instead of including a general
part of speech token, just suffix your original word with your marker and
use that for your "synonym.

Then expressing your intent is simply tacking on the part of speech
marker to the words you care about (e.g. report_n when you wanted
report as a noun). No phrases or slop required, at the expense of
more terms.

Hmmmm, if you wanted to, say, "find all the nouns in the index", you
could *prefix* the word (e.g. n_report) which would group all the
nouns together in the term enumerations....

Sorry for the confusion
Erick


On Thu, Nov 19, 2009 at 9:38 AM, Christopher Tignor <ctignor@thinkmap.com>wrote:

> Thanks, Erick -
>
> Indeed every word will have a part of speech token but Is this how the slop
> actually works?  My understanding was that if I have two tokens in the same
> location then each will not effect searches involving other in terms of the
> slop as slop indicates the number of words *between* search terms in a
> phrase.
>
>
Are tokens at the same location actually adjacent in their ordinal values,
> thus affecting the slop as you describe?
>
> If so, Is there a predictable way to determine which comes before the other
> - perhaps the order they are inserted when being tokenized?
>
> thanks,
>
> C>T>
>
> On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > If I'm reading this right, your tokenizer creates two tokens. One
> > "report" and one "_n"... I suspect if so that this will create some
> > "interesting"
> > behaviors. For instance, if you put two tokens in place, are you going
> > to double the slop when you don't care about part of speech? Is every
> > word going to get a marker? etc.
> >
> > I'm not sure payloads would be useful here, but you might check it out...
> >
> > What I'd think about, though, is a variant of synonyms. That is, index
> > report and report_n (note no space) at the same location. Then, when
> > you wanted to create a part-of-speech-aware query, you'd attach the
> > various markers to your terms (_n, _v, _adj, _adv etc.) and not have to
> > worry about unexpected side-effects.
> >
> > HTH
> > Erick
> >
> > On Wed, Nov 18, 2009 at 5:20 PM, Christopher Tignor <
> ctignor@thinkmap.com
> > >wrote:
> >
> > > Hello,
> > >
> > > I have indexed words in my documents with part of speech tags at the
> same
> > > location as these words using a custom Tokenizer as described, very
> > > helpfully, here:
> > >
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3C20060712115026.38897.qmail@web26002.mail.ukl.yahoo.com%3E
> > >
> > > I would like to do a search that retrieves documents when a given word
> is
> > > used with a specific part of speech, e.g. all docs where "report" is
> used
> > > as
> > > a noun.
> > >
> > > I was hoping I could use something like a PhraseQuery with "report _n"
> > (_n
> > > is my noun part of speech tag) with some sort of identifier that
> > describes
> > > the words as having to be at the same location - like a null slop or
> > > something.
> > >
> > > Any thoughts on how to do this?
> > >
> > > thanks so much,
> > >
> > > C>T>
> > >
> > > --
> > > TH!NKMAP
> > >
> > > Christopher Tignor | Senior Software Architect
> > > 155 Spring Street NY, NY 10012
> > > p.212-285-8600 x385 f.212-285-8999
> > >
> >
>
>
>
> --
> TH!NKMAP
>
> Christopher Tignor | Senior Software Architect
> 155 Spring Street NY, NY 10012
> p.212-285-8600 x385 f.212-285-8999
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message