lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <rstave...@seseit.com>
Subject RE: Sorting
Date Tue, 01 Aug 2006 09:16:51 GMT
>  file seeks instead of array lookups

I'm with you now. So you do seeks in your comparator. For a large index you
might as well use java.io.RandomAccessFile for the "array", because there
would be little value in buffering when the comparator is liable to jump all
around the file. This sounds very expensive, though. If you don't open a
Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
reduce the number of seeks. That was the half-baked idea of the third file,
which essentially orders document IDs.

> Bear in mind, there have been some improvements recently to the ability to
grab individual stored fields per document....

I can't see anything like that in 2.0. Is that something in the Lucene HEAD
build?

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 01 August 2006 09:37
To: java-user@lucene.apache.org
Subject: RE: Sorting


: I take your point that Berkley DB would be much less clumsy, but an
: application that's already using a relational database for other purposes
: might as well use that relational database, no?

if you already have some need to access data about each matching doc from a
relational DB, then sure you might as well let it sort for you -- but just
bcause your APP has some DB connections open doesn't mean that's a
worthwhile reason to ask it to do the sort ... your app might have some
netowrk connections open to an IMAP server as well .. that doesn't mean you
should convert the docs to email messages and ask the IMAP server to sort
them :)

: I'm not really with you on the random access file, Chris. Here's where I
am
: up to with my [mis-]understanding...
:
: I want to sort on 2 terms. Happily these can be ints (the first is an INT
: corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a
: hash of a string, used to group similar documents together within those 10
: minute timestamps). When I initially warm up the FieldCache (first search
: after opening the Searcher), I start by generating two random access files
: with int values at offsets corresponding to document IDs for each of
these;
: the first file would have ints corresponding to the timestamp and the
second
: would have integers corresponding to the hash. I'd then need to generate a
: third file which is equivalent to an array dimensioned by document ID,
with
: document IDs in compound sort order??

i'm not sure why you think you need the third file ... you should be able to
use the two files you created exactly the way the existing code would use
the two arrays if you were using an in memory FieldCache (with file seeks
instead of array lookups) .. i think the class you want to look at is
FieldSortedHitQueue

: In a big index, it will take a while to walk through all of the documents
to
: generate the first two random access files and the sort process required
to
: generate the sorted file is going to be hard work.

well .. yes.  but that's the trade off, the reason for the RAM based
FieldCache is speed .. if you don't have that RAM to use, then doing the
same things on disk gets slower.


Bear in mind, there have been some improvements recently to the ability to
grab individual stored fields per document (FieldSelector is the name of the
class i think) ... i haven't tried those out yet, but they could make
Sorting on a stored field (which wouldn't require building up any cache -
RAM or Disk based) feasible regardless of the size of your result sets ...
but i haven't tried that yet.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message