lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Phrase query with terms at same location
Date Thu, 19 Nov 2009 13:35:13 GMT
If I'm reading this right, your tokenizer creates two tokens. One
"report" and one "_n"... I suspect if so that this will create some
"interesting"
behaviors. For instance, if you put two tokens in place, are you going
to double the slop when you don't care about part of speech? Is every
word going to get a marker? etc.

I'm not sure payloads would be useful here, but you might check it out...

What I'd think about, though, is a variant of synonyms. That is, index
report and report_n (note no space) at the same location. Then, when
you wanted to create a part-of-speech-aware query, you'd attach the
various markers to your terms (_n, _v, _adj, _adv etc.) and not have to
worry about unexpected side-effects.

HTH
Erick

On Wed, Nov 18, 2009 at 5:20 PM, Christopher Tignor <ctignor@thinkmap.com>wrote:

> Hello,
>
> I have indexed words in my documents with part of speech tags at the same
> location as these words using a custom Tokenizer as described, very
> helpfully, here:
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3C20060712115026.38897.qmail@web26002.mail.ukl.yahoo.com%3E
>
> I would like to do a search that retrieves documents when a given word is
> used with a specific part of speech, e.g. all docs where "report" is used
> as
> a noun.
>
> I was hoping I could use something like a PhraseQuery with "report _n" (_n
> is my noun part of speech tag) with some sort of identifier that describes
> the words as having to be at the same location - like a null slop or
> something.
>
> Any thoughts on how to do this?
>
> thanks so much,
>
> C>T>
>
> --
> TH!NKMAP
>
> Christopher Tignor | Senior Software Architect
> 155 Spring Street NY, NY 10012
> p.212-285-8600 x385 f.212-285-8999
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message