How about this:
Make 1 pass through the big chunk which is 166333 rows (or could be millions): For each row, decide whether or not it belongs to the final 1000 chunk. To do this efficiently, the tricky part needs to be on the 1000 chunk side.
Steps:
1. Keep and maintain max-min values for this chunk (the 1000 rows result) at all times.
2. If value is above max, accept right away, add to the top. If required (when chunk is already full): drop last value from bottom.
3. If value is below min and we already have 1000 in hand, reject right away. If we have less than 1000: add this value to the bottom.
4. If value falls in between max and min: Interpolate (or simple binary search - or some kind of a hashtable that keeps values sorted) to find exact location for this value in our 1000 chunk. Insert value in proper location. If required: drop last value from bottom and adjust min-max.
5. When we are done scanning through 166333 rows will have our 1000 chunk in our hand ready to return.
 
This looks more scalable than sorting 100s of thousands of rows when Derby returns small chunks out of big big tables. It may be slower when chunk size is bigger. If hash sort etc (constant time) is used to decide position of a value in the final chunk, then even if slower, it will still be scalable (it should not be grossly bad even if optimizer picks this algorithm over normal sort when it should not have done so).
 
For parallelism, it gives the opportunity (simple enough to outline here on the fly) to divide the whole task (divide 166333 into chunks of 16K rows for 10 threads to work on for example) into multiple threads where min-max, insert, drop from bottom kinda stuff needs to be protected and modifications on the final chunk structure need to be communicated to other threads. Assuming reading a row from 166333 result needs IO, when one thread is doing IO, another thread will be trying to decide where to insert it's newly found value and it *might* bring performance gains. Since final chunk needs to be synchronized at various points, two threads cannot insert at the same time and one thread cannot read the chunk structure while another is inserting a value into it... Come to think of it, it might run faster in single thread version when synchronization is not involved.
Regards,
Ali

Daniel John Debrunner <djd@debrunners.com> wrote:
Suavi Ali Demir wrote:

> Actually, it sounds like the problem of finding top 1000 rows out of
> 166333 rows is different than sorting 166333 rows and maybe it could be
> optimized. There is no need to sort all 166333 but the information that
> we are only looking 1000 rows would have to be passed all the way down
> to the point where Derby decides to sort. I have not thought through the
> details of an algorithm but when nRows we want is substantially smaller
> than TotalRows then i just feel there should be a better way to pick
> those nRows. For example, if nRows were 1, then all we had to do would
> be 1 single pass on 166333 rows to find the max. That is quite different
> than sorting all and this idea should be possible to generalize on
> 1<=nRows
One optimization would be to pass the 1,000 down from
Statement.setMaxRows to the sorter. Then the sorter could keep the sort
set at 1000 rows, discarding any rows moved to or inserted at the end.
This would most likely enable the sorted set to remain in memory, rather
than spilling to disk. Even more likley if the application sets a max
rows to a reasonable number to be views in a single on-screen page (e.g.
20-50).

Or an even wilder idea would be to have a predicate that is modified on
the fly, to represent the maximum once the sorted set reaches 1,000.
E.g. an additional predicate of <= MAX_INT for an INT column.

Dan.