lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Improving TimeLimitedCollector
Date Sat, 27 Jun 2009 12:16:06 GMT
>
> A downside of breaking it out into static methods like this is that a
> thread cannot run >1 time-limited activity simultaneously but I guess that
> might be a reasonable restriction.
>

I'm not sure I understand that - how can a thread run >1 activity
simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows
it to do so? You use the thread as a key to the map. Am I missing something?

Anyway, I think we can let go of the static methods and make them instance
methods. I think .. if I want to use time limited activities, I should
create a TimeLimitedThreadActivity instance and pass it around, to
TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter)
and any other custom code I have which I want to put a time limit on.

A static class has the advantage of not needing to pass it around
everywhere, and is accessible from everywhere, so that if we discover that
limiting on IndexReader is not enough, and we want some of the scorers to
check more frequently if they should stop, we won't need to pass that
instance all the way down to them.

I don't mind keeping it static, but I also don't mind if it will be an
instance passed around, since currently it's only passed to IndexReader.

Are you going to open an issue for that? Seems like a nice addition to me.
Do you think it should belong in core or contrib? If 'core' then if that's
possible I'd like to see all timeout classes under one package, including
TimeLimitingCollector (which until 2.9 we can safely move around as we
want).
I don't mind working on that w/ you, if you want.

Shai

On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood <markharw00d@yahoo.co.uk>wrote:

> Thanks for the feedback, Shai.
> So I guess you're suggesting breaking this out into a general utility class
> e.g. something like:
>
> class TimeLimitedThreadActivity
> {
>         //called by client
>         public static void startTimeLimitedActivity(long maxTimePermitted).
>         public static void endTimeLimitedActivity()
>
>        //called by resources (reader/writers) that need to be shared fairly
> by threads
>       public static void checkActivityNotElapsed(); //throws some form of
> runtime exception
> }
>
> A downside of breaking it out into static methods like this is that a
> thread cannot run >1 time-limited activity simultaneously but I guess that
> might be a reasonable restriction.
>
>
> >>Aside, how about using a PQ for the threads' times, or a TreeMap. That
> will save looping over the collection to find the next candidate. Just an
> implementation detail though.
>
> Yep, that was one of the rough edges - I just wanted to get raw timings
> first for the all the "is timed out?" checks we're injecting into reader
> calls.
>
> Cheers
> Mark
>
>
> On 27 Jun 2009, at 11:37, Shai Erera wrote:
>
> I like the overall approach. However it's very local to an IndexReader.
> I.e., if someone wanted to limit other operations (say indexing), or does
> not use an IndexReader (for a Scorer impl maybe), one cannot reuse it.
>
> What if we factor out the timeout logic to a Timeout class (I think it can
> be a static class, with the way you implemented it) and use it in
> TimeLimitingIndexReader? That class can offer a method check() which will do
> the internal logic (the 'if' check and throw exception). It will be similar
> to the current ensureOpen() followed by an operation.
>
> It might be considered more expensive since it won't check a boolean, but
> instead call a check() method, but it will be more reusable. Also,
> ensureOpen today is also a method call, so I don't think Timeout.check() is
> that bad. We can even later create a TimeLimitingIndexWriter and document
> Timeout class for other usage by external code.
>
> Aside, how about using a PQ for the threads' times, or a TreeMap. That will
> save looping over the collection to find the next candidate. Just an
> implementation detail though.
>
> Shai
>
> On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood <markharw00d@yahoo.co.uk>wrote:
>
>> Going back to my post re TimeLimitedIndexReaders - here's an incomplete
>> but functional prototype:
>> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
>> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java
>>
>>
>> The principle is that all reader accesses check a volatile variable
>> indicating something may have timed out (no need to check thread locals
>> etc.) If and only if a time out has been noted threadlocals are checked to
>> see which thread should throw a timeout exception.
>>
>> All time-limited use of reader must be wrapped in try...finally calls to
>> indicate the start and stop of a timed set of activities. A background
>> thread maintains the next anticipated timeout deadline and simply waits
>> until this is reached or the list of planned activities changes with new
>> deadlines.
>>
>>
>> Performance seems reasonable on my Wikipedia index:
>>
>> //some tests for heavy use of termenum/term docs
>> Read term docs for 200000 terms  in 4755 ms using no timeout limit (warm
>> up)
>> Read term docs for 200000 terms  in 4320 ms using no timeout limit (warm
>> up)
>> Read term docs for 200000 terms  in 4320 ms using no timeout limit
>> Read term docs for 200000 terms  in 4388 ms using  reader with
>> time-limited access
>>
>> //Example query with heavy use of termEnum/termDocs
>> +text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms
>> +text:f* +text:a* +text:b* time limited collector matched 1090041 docs in
>> 1963 ms
>> +text:f* +text:a* +text:b* time limited reader matched 1090041 docs in
>> 2121 ms
>>
>> //Example fuzzy match burning CPU reading TermEnum
>> text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms
>> text:accomodation~0.5 time limited collector matched 192084 docs in 5923
>> ms
>> text:accomodation~0.5 time limited reader matched 192084 docs in 5945 ms
>>
>>
>> The reader approach to limiting time is slower but has these advantages :
>>
>> 1) Multiple reader activities can be time-limited rather than just single
>> searches
>> 2) No code changes required to scorers/queries/filters etc
>> 3) Tasks that spend plenty of  time burning CPU before collection happens
>> can be killed earlier
>>
>> I'm sure there's some thread safety issues to work through in my code and
>> not all reader classes are wrapped (e.g. TermPositions) but the basics are
>> there and seem to be functioning
>>
>> Thoughts?
>>
>
>
>

Mime
View raw message