From lucene-user-return-302-qmlist-jakarta-archive-lucene-user=jakarta.apache.org@jakarta.apache.org Mon Nov 12 19:58:34 2001 Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 37710 invoked from network); 12 Nov 2001 19:58:34 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 12 Nov 2001 19:58:34 -0000 Received: (qmail 18841 invoked by uid 97); 12 Nov 2001 19:58:37 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 18811 invoked by uid 97); 12 Nov 2001 19:58:36 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 18800 invoked from network); 12 Nov 2001 19:58:36 -0000 Message-ID: <4BC270C6AB8AD411AD0B00B0D0493DF0EE7CFC@mail.grandcentral.com> From: Doug Cutting To: 'Lucene Users List' Subject: RE: Memory Usage? Date: Mon, 12 Nov 2001 11:46:39 -0800 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: multipart/mixed; boundary="----_=_NextPart_000_01C16BB2.BEFBFAA0" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N ------_=_NextPart_000_01C16BB2.BEFBFAA0 Content-Type: text/plain; charset="iso-8859-1" > From: Anders Nielsen [mailto:anders@visator.dk] > > this was a big boolean query, with several prefixqueries but > no wildcard > queries in the or-branches. Well it looks like those prefixes are expanding to a lot of terms, a total of over 40,000! (A prefix query expands into a BooleanQuery with all the terms matching the prefix.) If most of these expansions are low-frequency, then a simple fix should improve things considerably. I've attached an optimized version of TermQuery that will hold less memory per low-frequency term. In particular, if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is freed immediately. Tell me how this works. Please send another heap dump. Longer term, or if lots of the expanded terms occur more than 128 times, perhaps BooleanScorer should use a different algorithm when there are thousands of terms. In this case it might use less memory to construct an array of score buckets for all documents. If (query.termCount() * 1024) > (12 * getMaxDoc()) then this would use less memory. In your case, with 500,000 documents and a 40,000 term query, it's currently taking 40MB/query, and could be done in 6MB/query. This optimization would not be too difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer. Doug ------_=_NextPart_000_01C16BB2.BEFBFAA0 Content-Type: application/octet-stream; name="PhraseQuery.java" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="PhraseQuery.java" package org.apache.lucene.search; /* = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software = itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and * "Apache Lucene" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact apache@apache.org. * * 5. Products derived from this software may not be called "Apache", * "Apache Lucene", nor may "Apache" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * . */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermPositions; import org.apache.lucene.index.IndexReader; /** A Query that matches documents containing a particular sequence of = terms. This may be combined with other terms with a {@link BooleanQuery}. */ final public class PhraseQuery extends Query { private String field; private Vector terms =3D new Vector(); private float idf =3D 0.0f; private float weight =3D 0.0f; private float boost =3D 1.0f; private int slop =3D 0; /** Constructs an empty phrase query. */ public PhraseQuery() { } /** Sets the boost for this term to b. Documents = containing this term will (in addition to the normal weightings) have their = score multiplied by b. */ public final void setBoost(float b) { boost =3D b; } /** Gets the boost for this term. Documents containing this term will (in addition to the normal weightings) have their = score multiplied by b. The boost is 1.0 by default. */ public final float getBoost() { return boost; } =20 /** Sets the number of other words permitted between words in query = phrase. If zero, then this is an exact phrase search. For larger values = this works like a WITHIN or NEAR operator.

The slop is in fact an edit-distance, where the units correspond = to moves of terms in the query phrase out of position. For example, = to switch the order of two words requires two moves (the first move places = the words atop one another), so to permit re-orderings of phrases, the slop = must be at least two.

More exact matches are scored higher than sloppier matches, thus = search results are sorted by exactness.

The slop is zero by default, requiring exact matches.*/ public final void setSlop(int s) { slop =3D s; } /** Returns the slop. See setSlop(). */ public final int getSlop() { return slop; } /** Adds a term to the end of the query phrase. */ public final void add(Term term) { if (terms.size() =3D=3D 0) field =3D term.field(); else if (term.field() !=3D field) throw new IllegalArgumentException ("All phrase terms must be in the same field: " + term); terms.addElement(term); } final float sumOfSquaredWeights(Searcher searcher) throws IOException = { for (int i =3D 0; i < terms.size(); i++) // sum term IDFs idf +=3D Similarity.idf((Term)terms.elementAt(i), searcher); weight =3D idf * boost; return weight * weight; // square term weights } final void normalize(float norm) { weight *=3D norm; // normalize for query weight *=3D idf; // factor from document } final Scorer scorer(IndexReader reader) throws IOException { if (terms.size() =3D=3D 0) // optimize zero-term case return null; if (terms.size() =3D=3D 1) { // optimize one-term case Term term =3D (Term)terms.elementAt(0); TermDocs docs =3D reader.termDocs(term); if (docs =3D=3D null) return null; return new TermScorer(docs, reader.docFreq(term), reader.norms(term.field()), weight); } TermPositions[] tps =3D new TermPositions[terms.size()]; for (int i =3D 0; i < terms.size(); i++) { TermPositions p =3D = reader.termPositions((Term)terms.elementAt(i)); if (p =3D=3D null) return null; tps[i] =3D p; } if (slop =3D=3D 0) // optimize exact case return new ExactPhraseScorer(tps, reader.norms(field), weight); else return new SloppyPhraseScorer(tps, slop, reader.norms(field), weight); } /** Prints a user-readable version of this query. */ public final String toString(String f) { StringBuffer buffer =3D new StringBuffer(); if (!field.equals(f)) { buffer.append(field); buffer.append(":"); } buffer.append("\""); for (int i =3D 0; i < terms.size(); i++) { buffer.append(((Term)terms.elementAt(i)).text()); if (i !=3D terms.size()-1) buffer.append(" "); } buffer.append("\""); if (boost !=3D 1.0f) { buffer.append("^"); buffer.append(Float.toString(boost)); } return buffer.toString(); } } ------_=_NextPart_000_01C16BB2.BEFBFAA0 Content-Type: application/octet-stream; name="TermQuery.java" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="TermQuery.java" package org.apache.lucene.search; /* = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software = itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and * "Apache Lucene" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact apache@apache.org. * * 5. Products derived from this software may not be called "Apache", * "Apache Lucene", nor may "Apache" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * . */ import java.io.IOException; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.IndexReader; /** A Query that matches documents containing a term. This may be combined with other terms with a {@link BooleanQuery}. */ final public class TermQuery extends Query { private Term term; private int docFreq; private float boost =3D 1.0f; private float idf =3D 0.0f; private float weight =3D 0.0f; /** Constructs a query for the term t. */ public TermQuery(Term t) { term =3D t; } /** Sets the boost for this term to b. Documents = containing this term will (in addition to the normal weightings) have their = score multiplied by b. */ public void setBoost(float b) { boost =3D b; } /** Gets the boost for this term. Documents containing this term will (in addition to the normal weightings) have their = score multiplied by b. The boost is 1.0 by default. */ public float getBoost() { return boost; } =20 final float sumOfSquaredWeights(Searcher searcher) throws IOException = { docFreq =3D searcher.docFreq(term); idf =3D Similarity.idf(docFreq, searcher.maxDoc()); weight =3D idf; return weight * weight; // square term weights } final void normalize(float norm) { weight *=3D norm; // normalize for query weight *=3D idf; // factor from document weight *=3D boost; // boost query } Scorer scorer(IndexReader reader) throws IOException { TermDocs termDocs =3D reader.termDocs(term); if (termDocs =3D=3D null) return null; =20 return new TermScorer(termDocs, docFreq, reader.norms(term.field()), weight); } /** Prints a user-readable version of this query. */ public String toString(String field) { StringBuffer buffer =3D new StringBuffer(); if (!term.field().equals(field)) { buffer.append(term.field()); buffer.append(":"); } buffer.append(term.text()); if (boost !=3D 1.0f) { buffer.append("^"); buffer.append(Float.toString(boost)); } return buffer.toString(); } } ------_=_NextPart_000_01C16BB2.BEFBFAA0 Content-Type: application/octet-stream; name="TermScorer.java" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="TermScorer.java" package org.apache.lucene.search; /* = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software = itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and * "Apache Lucene" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact apache@apache.org. * * 5. Products derived from this software may not be called "Apache", * "Apache Lucene", nor may "Apache" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * . */ import java.io.IOException; import org.apache.lucene.index.TermDocs; final class TermScorer extends Scorer { private TermDocs termDocs; private byte[] norms; private float weight; private int doc; private static final int BUFFER_SIZE =3D 128; private static final int[] SENTINEL =3D {Integer.MAX_VALUE}; private int[] docs; // buffered doc = numbers private int[] freqs; // buffered term = freqs private int pointer; // pointer into = buffers private int pointerMax; // end of buffers private static final int SCORE_CACHE_SIZE =3D 32; private float[] scoreCache =3D new float[SCORE_CACHE_SIZE]; TermScorer(TermDocs td, int docFreq, byte[] n, float w) throws = IOException { termDocs =3D td; norms =3D n; weight =3D w; for (int i =3D 0; i < SCORE_CACHE_SIZE; i++) scoreCache[i] =3D Similarity.tf(i) * weight; // minimize buffer sizes int bufferSize =3D (docFreq < BUFFER_SIZE) ? (docFreq+1) : = BUFFER_SIZE; docs =3D new int[bufferSize]; freqs =3D new int[bufferSize]; refill(); } private void refill() throws IOException { if (termDocs !=3D null) { pointerMax =3D termDocs.read(docs, freqs); // try to fill = buffers if (pointerMax < docs.length) { // read last into = buffers termDocs.close(); // close stream termDocs =3D null; // gc stream's = buffer } } else { pointerMax =3D 0; } if (pointerMax =3D=3D 0) { // last buffer = consumed docs =3D SENTINEL; // gc docs, set = sentinel freqs =3D null; // gc freqs } pointer =3D 0; doc =3D docs[pointer]; } final void score(HitCollector c, final int end) throws IOException { int d =3D doc; // cache doc in local while (d < end) { // for docs in window final int f =3D freqs[pointer]; float score =3D // compute tf(f)*weight f < SCORE_CACHE_SIZE // check cache ? scoreCache[f] // cache hit : Similarity.tf(f)*weight; // cache miss byte normByte =3D norms[d]; if (normByte !=3D 0) { // if not deleted score *=3D Similarity.norm(normByte); // normalize for field c.collect(d, score); // collect score } if (++pointer =3D=3D pointerMax) refill(); d =3D docs[pointer]; } doc =3D d; // flush cache } } ------_=_NextPart_000_01C16BB2.BEFBFAA0 Content-Type: text/plain; charset=us-ascii -- To unsubscribe, e-mail: For additional commands, e-mail: ------_=_NextPart_000_01C16BB2.BEFBFAA0--