Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 69226 invoked from network); 17 Apr 2006 06:14:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 17 Apr 2006 06:14:19 -0000 Received: (qmail 36901 invoked by uid 500); 17 Apr 2006 06:14:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 36866 invoked by uid 500); 17 Apr 2006 06:14:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 36835 invoked by uid 99); 17 Apr 2006 06:14:13 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Apr 2006 23:14:13 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [195.121.6.176] (HELO hnexfe10.hetnet.nl) (195.121.6.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Apr 2006 23:14:12 -0700 Received: from [192.168.0.100] ([86.85.154.64]) by hnexfe10.hetnet.nl with Microsoft SMTPSVC(5.0.2195.6874); Mon, 17 Apr 2006 08:13:47 +0200 In-Reply-To: <230F88BC-DC49-4128-9303-DE2FE4E2FED8@snigel.net> References: <53423917-9808-4EA8-996A-95F7CD6218FD@snigel.net> <443FCE53.9050707@apache.org> <678FBE2D-7BDF-4609-ADF3-4B4BD2844BEB@snigel.net> <200604152132.18437.paul.elschot@xs4all.nl> <230F88BC-DC49-4128-9303-DE2FE4E2FED8@snigel.net> Mime-Version: 1.0 (Apple Message framework v746.3) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <58912026-8F34-4D9D-80B7-13D89CC16591@snigel.net> Cc: java-dev@lucene.apache.org Content-Transfer-Encoding: 7bit From: karl wettin Subject: Re: Using Lucene for searching tokens, not storing them. Date: Mon, 17 Apr 2006 08:16:04 +0200 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.746.3) X-OriginalArrivalTime: 17 Apr 2006 06:13:47.0415 (UTC) FILETIME=[167AB270:01C661E6] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N 16 apr 2006 kl. 19.18 skrev karl wettin: > For any interested party, I do this because I have a fairly small > corpus with very heavy load. I think there is a lot to win by not > creating new instances of what not, seeking in the file-centric > Directory, parsing pseudo-UTF8, et.c. at query time. I simply store > all instance of everything (the index in a bunch of Lists and Maps. > Bits are cheaper than ticks. I will most definitely follow this path. My tests used the IMDB tv-series as corpus. It contains about 45 000 documents and has plenty of unique terms. On my G4 the 190 000 queries took: 193 476 milliseconds on a RAMDirectory 123 193 milliseconds with my code branch. That is about 40% less time. The code contains lots of things that can be optimized for both memory and CPU. Pretty sure it can be cranked down to use a fraction of the ticks spent by a RAMDirectory. I aim at 1/3. The FSDirectory take 5MB with no fields stored. My implementation occupies about 100MB RAM, but that includes me treating all fields as Store.YES so it is not comparable at this stage. I did not time the indexing, but it felt as it was about three to five times as fast. Personally I'll be using Prevayler for persistence (java.io.Serializable with transactions). Basically, this is what I did: public final class Document implements java.io.Serializable { private static final long serialVersionUID = 1l; private Integer documentNumber; private Map termsPositions; private Map termFrequecyVectorsByField; private TermFreqVector[] termFrequencyVectors; public final class Term implements Comparable, java.io.Serializable { private static final long serialVersionUID = 1l; private int orderIndex; private ArrayList documents; public class MemImplManager implements Serializable { private static final long serialVersionUID = 1l; private transient Map normsByFieldCache; private Map> normsByField; private ArrayList orderedTerms; private ArrayList documents; private Map> termsByFieldAndName; private class MemImplReader extends IndexReader { ... So far everything is not fully implemented yet, hence my test only contains SpanQueries. for (int i = 0; i < 10000; i++) { placeQuery(new String[]{"csi", "ny"}); placeQuery(new String[]{"csi", "new", "york"}); placeQuery(new String[]{"star", "trek", "enterprise"}); placeQuery(new String[]{"star", "trek", "deep", "space"}); placeQuery(new String[]{"lust", "in", "space"}); placeQuery(new String[]{"lost", "in", "space"}); placeQuery(new String[]{"lost"}); placeQuery(new String[]{"that", "70", "show"}); placeQuery(new String[]{"the", "y-files"}); placeQuery(new String[]{"csi", "las", "vegas"}); placeQuery(new String[]{"stargate", "sg-1"}); placeQuery(new String[]{"stargate", "atlantis"}); placeQuery(new String[]{"miami", "vice"}); placeQuery(new String[]{"miami", "voice"}); placeQuery(new String[]{"big", "brother"}); placeQuery(new String[]{"my", "name", "is", "earl"}); placeQuery(new String[]{"falcon", "crest"}); placeQuery(new String[]{"dallas"}); placeQuery(new String[]{"v"}); } protected Query buildQuery(String[] nameTokens) { BooleanQuery q = new BooleanQuery(); BooleanQuery bqStrategies = new BooleanQuery(); /**name ^10 */ { SpanQuery[] spanQueries = new SpanQuery[nameTokens.length]; for (int i = 0; i < spanQueries.length; i++) { spanQueries[i] = new SpanTermQuery(new Term("name", nameTokens[i])); } SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0, true); nameQuery.setBoost(10); bqStrategies.add(new BooleanClause(nameQuery, BooleanClause.Occur.SHOULD)); } /** aka name in order ^1 */ { SpanQuery[] spanQueries = new SpanQuery[nameTokens.length]; for (int i = 0; i < spanQueries.length; i++) { spanQueries[i] = new SpanTermQuery(new Term ("akaName", nameTokens[i])); } SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0, true); nameQuery.setBoost(1); bqStrategies.add(new BooleanClause(nameQuery, BooleanClause.Occur.SHOULD)); } q.add(new BooleanClause(bqStrategies, BooleanClause.Occur.MUST)); return q; } --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org