lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Hints on implementing XQuery full-text search
Date Tue, 19 Jan 2010 22:41:47 GMT

: I'm about to embark on implementing the full-text search feature of XQuery:

Good luck with that.

Here's some quick suggestions on how i'd try to tackle the things you 
asked about, w/o putting much thought into...

: 	title ftcontains "usability" occurs at least 2 times

assuming this is just term based (and not complex subclauses) i would 
write a custom subclass of TermQuery that enforces a minimum term frequency.

: 	title ftcontains "improve" with stemming

index two versions of every field - one with stemming and one w/o

: This allows you to specify -- at query-time -- one of "case 
: insensitive", "case sensitive", "lowercase", "uppercase".

I have no idea what it would mean to match something "uppercase" or 
"lowercase" -- unless that's just syntactic suger for "uppercase by input, 
and then look for a case sensitve match) but again: two fields for case 
sensitive/insensitive

: This is similar to the Cast Option except its "diacritics insensitive" 
: or "diacritics sensitive.  How about implementing this?

two fields, again.

...at this point, if you need to support all permutations of these options 
you are looking at 2*2*2 index fields per source field ... so you start 
getting into hte realm where i might consider keeping them all in one 
field, using Payloads to note the various attributes that each Term has.

: 	abstract ftcontains "propagating of errors"
: 	with stop words ("a", "the", "of")
: 
: would match a document with an abstract that contains "propagating few 
: errors". It seems odd, I know.  It's as if the stop words become 
: wildcards, i.e.:

are you serious? ... so if i query for "A of the B" with stop words ("of", 
"the") then that has to match "A totally ridiculous B" ? ...  that makes 
no sense what so ever.  why require so much verbosity just to get a "gap" 
that matches anything?

that seems like a straight query parsing problem ... if you see one of the 
terms in teh stop work list, strip it out, and increase the phrase slop on 
the PhraseQuery you are building.

: 	body ftcontains "Mexico" not in "New Mexico"

SpanNotQuery


: 	title ftcontains ("web site" ftand "usability") ordered

SpanNearQuery


: 	abstract ftcontains "usability" ftand "web site" same sentence
: 
: You can also do any combination of {same|different} 
: {sentence|paragraph}.  My guess for this would also be to keep track of 
: sentence/paragraph data in a payload.  Yes?

sounds right.


: 	book ftcontains "Web Usability" without content $x//annotation

depends on how you plan on indexing all of hte context stuff ... if the 
tags are Terms then a SpanNOtQuery would work ... if they are Payloads you 
just need some sort of SpanTermNotMatchingPayload query.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message