lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Ganyo <scott.ga...@etapestry.com>
Subject Re: Searching Ranges
Date Mon, 11 Nov 2002 17:42:00 GMT
Ok.  If I understand what you are saying, I believe that would be 
correct if you were iterating over Documents and for each Document you 
were trying to match a Term in the Range.

That is reversed, though:  The correct flow is Term->Document[]. 
Instead, you iterate over the set of Terms and for each Term there is a 
set of Documents that contain that Term.  Therefore, the RangeQuery 
creates a TermQuery for each Term that exists within the specified 
range.  The net result is that all Documents that have Terms within the 
Range are included in the result.

Does that make sense?
Scott

Alex Winston wrote:

> good thoughts and something that i would like to explore further. let me
> create a more concrete example that we can use to help visualize the
> problem and a possible solution and then get feedback.  it may be that
> the changes work in my case but not for all cases, or that there is
> something else that could help alleviate the overhead i am experiencing.
>
> lets say that i have a document named "d1", which contains a field named
> "references".  within the "references" field i maintain a list of terms
> that represent my range from 001-005, more specifically the field would
> contain the terms "001 002 003 004 005".
>
> i would now like to search this range to determine if it falls within
> the range 003-010, so my query would look like "references:[003 010]".
>
> in this case RangeQuery begins to iterate the terms contained within
> "references" to determine if "d1" is a match.  as it iterates each term
> is analyzed, 001 and 002 fail so we continue to iterate at which point
> 003 is determined to be a match.
>
> at this point there is no need to continue to search the terms because
> we have determined there is a match, not only that but it helps reduce
> the size of the byte[] cache that is created i believe.  like i
> mentioned earlier, i am not sure of the ramifications this may have on
> scoring, but it works for most cases i can think of, but if i am missing
> something that is what i need feedback on :).
>
> you can imagine how this improves the avg efficiency in my case if i
> have 10000 terms in "references".  although i may be doing something
> that was either not intended or ill-designed.
>
> thanks, any thoughts?
> alex
>
>
>
> On Mon, 2002-11-11 at 10:50, Scott Ganyo wrote:
>
> >Hi Alex,
> >
> >I just looked at this and had the following thought:
> >
> >The RangeQuery must continue to iterate after the first match is found
> >in order to match everything within the specified range.  In other
> >words, if you have a range of "a" to "d", you can't stop with "a", you
> >need to continue to "d".  At the point you move beyond "d" is the point
> >where the query should stop iterating.  That is reflected in lines
> >160-162.  It seems to me that your solution would only work where your
> >range consists of a single term.
> >
> >Please let me know if I'm just misunderstanding the situation.
> >
> >Scott
> >
> >Alex Winston wrote:
> >
> >
> >>thanks for the reply, my apologizes for not explaining myself very
> >>clearly, it has been a long day.
> >>
> >>you expressed exactly our situation, unfortunately this is not an option
> >>because we want to have multiple ranges for each document as well,
> >>there is a possible extension of what you suggested but that is a last
> >>resort.  kinda crazy i know, but you have to meet requirements :).
> >>
> >>but i also had a thought while i was looking through the lucene code,
> >>and any comments are welcome.
> >>
> >>i may be very mistaken because it has been a long day but if you look at
> >>the current cvs version of RangeQuery it appears that even if a match is
> >>found it will continue to iterate over terms within a field, and in my
> >>case it is on the order of thousands.  if i add a break after a match
> >>has been found it appears as though the search is improved on avg an
> >>order of magnitude, my math has left me so i cannot be theoretical at
> >>the moment.  i have unit tested the change on my side and on the lucene
> >>side and it works.  note: one hard example is that a query went from 20
> >>seconds to .5 seconds.  any initial thoughts to if there is a case where
> >>this would not work?
> >>
> >>beginning line 164:
> >>TermQuery tq = new TermQuery(term);	  // found a match
> >>tq.setBoost(boost);			   // set the boost
> >>q.add(tq, false, false);		  // add to q
> >>break;  // ADDED!
> >>
> >>
> >>On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> >>
> >>
> >>>Alex,
> >>>
> >>>It is rather confusing. It sounds like you've indexed
> >>>a field that that can be between two values (let's say
> >>>E-J) and then when you have a search term such as G
> >>>you want the docs containing E-J (or A-H or F-K but not A-H
> >>>nor A-C nor J-Z)
> >>>
> >>>Just of the top of my head but could you index the upper and
> >>>lower bounds as separate fields then when you search do a
> >>>compound query:
> >>>
> >>>    lower_bound:{ - search_term } AND upper_bound:{ search_term - }
> >>>
> >>>just a thought.
> >>>
> >>>
> >>>>-MikeB.
> >>>
> >>>
> >>>Alex Winston wrote:
> >>>
> >>>
> >>>
> >>>>i was hoping that someone could briefly review my current solution 
> to a
> >>>>problem that we have encountered to see if anyone could suggest a
> >>>>possible alternative, because as it stands we have pushed lucene past
> >>>>its current limits.
> >>>>
> >>>>PROBLEM:
> >>>>
> >>>>we were wanting to represent a range of values for a particular field
> >>>>that is searchable over a particular range.
> >>>>
> >>>>an example follows for clarification:
> >>>>we were wanting to store a range of chapters and verses of a book 
> for a
> >>>>particular document, and in turn search to see if a query range 
> includes
> >>>>the range that is represented in the index.
> >>>>
> >>>>if this is unclear please ask for clarification
> >>>>
> >>>>IMPRACTICAL SOLUTION:
> >>>>
> >>>>although this solution seems somewhat impractical it is all we could
> >>>>come up with.
> >>>>
> >>>>our solution involved storing each possible range value within the 
> term
> >>>>which would allow for RangeQuerys to be performed on this particular
> >>>>field.  for very small ranges this seems somewhat practical after
> >>>>profiling.  although once the field ranges began to span multiple
> >>>>chapters and verses, the search times became unreasonable because we
> >>>>were storing thousands of entries for each representative range.
> >>>>
> >>>>i can elaborate on anything that is unclear,
> >>>>but any thoughts on a possible alternative solution within lucene that
> >>>>we overlooked would be extremely helpful.
> >>>>	
> >>>>
> >>>>alex
> >>>
> >>>
> >>>
> >>>--
> >>>To unsubscribe, e-mail:
> >>>For additional commands, e-mail:
> >>>
> >>>
> >>
> >--
> >Brain: Pinky, are you pondering what I’m pondering?
> >Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were
> >they thinking?
> >
> >
> >--
> >To unsubscribe, e-mail:
> >For additional commands, e-mail:
> >
> >
>

-- 
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message