Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Subject: search performance & caching
Date: Mon, 28 Apr 2008 14:38:56 -0400
Message-ID: 
 <CC2671C08F06704ABFD37ED1EA09AA564EA502@birexchange.BIRPLAZA.local>
Thread-Topic: search performance & caching
Thread-Index: AcipMW3UxBnJpr43ShCpEs1cV8TccQAKQWrA
From: "Beard, Brian" <Brian.Beard@mybir.com>
To: <java-user@lucene.apache.org>


I'm using lucene 2.2.0 & have two questions:

1) Should search times be linear wrt number of queries hitting a  single
searcher? I've run multiple search threads against a single searcher,
and the search times are very linear - 10x slower for 10 threads vs 1
thread, etc. I'm using a paralle multi-searcher with a custom hit
collector.

2) I'm performing some field caching during search warmup. For an index
of 3.4 million doc's and 7GB, it's taking up to 30 minutes to execute
the code snippet below. Most of this time is involved with the
multireader.document call (where it says "THIS TAKES THE MOST TIME").

I want to know if anyone has any ideas for speeding this up. There are
multiple documents containing the same recordId. I want to figure out
which two documents with the same recordId also have a documentName of
CORE or WL.
Then for each document in the index I store three pieces of information:
- it's associated recordId
- the CORE doc number for this recordId.
- the WL doc number for this recordId

Ideally, since the multiReader.document call is taking the most time,
I'd like to not have to perform this. Although I can't figure out how to
get around needing to read in the recordId.

What I really need is something like a two dimensional termEnum I could
iterate over - for the recordId and documentName fields.

Any ideas are appreciated.

// Now loop through all documents in the indexes and set the cache
values.
TermDocs termDocs =3D multiReader.termDocs();
TermEnum termEnum =3D multiReader.terms (new Term ("RECORD_ID", ""));
try {
    FieldSelector fieldSelector =3D getFieldSelector();
    List<Integer> docList =3D new ArrayList<Integer>();
    int regularCoreDocId =3D -1;
    int wlCoreDocId =3D -1;
    int docId =3D -1;
    Document document =3D null;
    String documentName =3D null;
	               =20
    // Loop through each RECORD_ID with termEnums
    do {
        docList.clear();
	  regularCoreDocId =3D -1;
	  wlCoreDocId =3D -1;
		            =09
	 Term term =3D termEnum.term();
	 if (term =3D=3D null || term.field() !=3D field) {
	   break;
	  }
	  String recordId =3D term.text();
		               =20
	  // Now loop through all documents with the same recordId
	  // using the termDocs.
	  termDocs.seek(termEnum);
	  while (termDocs.next()) {
		docId =3D termDocs.doc();
		docList.add(Integer.valueOf(docId));
            // THIS TAKES THE MOST TIME
		document =3D multiReader.document(docId, fieldSelector);
		documentName =3D document.get("DOCUMENT_NAME");
		if ("CORE".equals(documentName)) {
		    regularCoreDocId =3D docId;
		} else if ("WL".equals(documentName)) {
		    wlCoreDocId =3D docId;
		}
	    }
		               =20
	  // Map all docId's associated with this recordId
	  for (Integer i : docList) {
          doc2RecordId [i] =3D recordId;
	  }
		               =20
	  // Map from the docId to the coreData docId for =20
	  // regular core and wl core documents.
        for (Integer i : docList) {
           doc2RegularCoreDoc[i] =3D regularCoreDocId;
	     wlCoreDocId [i] =3D wlCoreDocId;
        }
   } while (termEnum.next());
} finally {
    termDocs.close();
    termEnum.close();
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org