lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <whosc...@lbl.gov>
Subject Re: [Performance] Streaming main memory indexing of single strings
Date Fri, 15 Apr 2005 22:58:14 GMT
A primary reason for the tight API proposed below is that it allows for 
maximum efficiency (which is the point of the exercise in the first 
place):

- One can extend Searcher rather than IndexReader: There's no need to 
subclass IndexReader and carry the baggage of the superclass doing 
locking and all sort of unnecessary stuff with its internal 
RAMDirectory.
- Even more extreme: Don't extend Searcher but implement the 
functionality directly using low-level APIs. This avoids unnecessary 
baggage for collecting hits, etc.

Wolfgang.

On Apr 15, 2005, at 3:15 PM, Wolfgang Hoschek wrote:

> Cool! For my use case it would need to be able to handle arbitrary 
> queries (previously parsed from a general lucene query string). 
> Something like:
>
> 	float match(String Text, Query query)
>
> it's fine with me if it also works for
>
> 	float[] match(String[] texts, Query query) or
> 	float(Document doc, Query query)
>
> but that isn't required by the use case.
>
> Wolfgang.
>
>> I am intrigued by this and decided to mock a quick and dirty example 
>> of such an IndexReader.  After a little trial-and-error I got it 
>> working at least for TermQuery and WildcardQuery.  I've pasted my 
>> code below as an example, but there is much room for improvement, 
>> especially in terms of performance and also in keeping track of term 
>> frequency, and also it would be nicer if it handled the analysis 
>> internally.
>>
>> I think something like this would make a handy addition to our 
>> contrib area at least.  I'd be happy to receive improvements to this 
>> and then add it to a contrib subproject.
>>
>> Perhaps this would be a handy way to handle situations where users 
>> have queries saved in a system and need to be alerted whenever a new 
>> document arrives matching the saved queries?
>>
>> 	Erik
>>
>>
>>
>>>
>>>
>>> -----Original Message-----
>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>> Sent: Thursday, April 14, 2005 4:04 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>> strings
>>>
>>>
>>> This seems to be a promising avenue worth exploring. My gutfeeling is
>>> that this could easily be 10-100 times faster.
>>>
>>> The drawback is that it requires a fair amount of understanding of
>>> intricate Lucene internals, pulling those pieces together and 
>>> adapting
>>> them as required for the seemingly simple "float match(String text,
>>> Query query)".
>>>
>>> I might give it a shot but I'm not sure I'll be able to pull this 
>>> off!
>>> Is there any similar code I could look at as a starting point?
>>>
>>> Wolfgang.
>>>
>>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>>
>>>> I think you are not approaching this the correct way.
>>>>
>>>> Pseudo code:
>>>>
>>>> Subclass IndexReader.
>>>>
>>>> Get tokens from String 'document' using Lucene analyzers.
>>>>
>>>> Build simple hash-map based data structures using tokens for terms,
>>>> and term
>>>> positions.
>>>>
>>>> reimplement termDocs() and termPositions() to use the structures 
>>>> from
>>>> above.
>>>>
>>>> run searches.
>>>>
>>>> start again with next document.
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>>> Sent: Thursday, April 14, 2005 2:56 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>>> strings
>>>>
>>>>
>>>> Otis, this might be a misunderstanding.
>>>>
>>>> - I'm not calling optimize(). That piece is commented out you if 
>>>> look
>>>> again at the code.
>>>> - The *streaming* use case requires that for each query I add one 
>>>> (and
>>>> only one) document (aka string) to an empty index:
>>>>
>>>> repeat N times (where N is millions or billions):
>>>> 	add a single string (aka document) to an empty index
>>>> 	query the index
>>>> 	drop index (or delete it's document)
>>>>
>>>> with the following API being called N times: float match(String 
>>>> text,
>>>> Query query)
>>>>
>>>> So there's no possibility of adding many documents and thereafter
>>>> running the query. This in turn seems to mean that the IndexWriter
>>>> can't be kept open - unless I manually delete each document after 
>>>> each
>>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>>> before without any significant performance gain - deletion seems to
>>>> have substantial overhead in itself. Perhaps it would be better if
>>>> there were a Directory.deleteAllDocuments() or similar. Did you have
>>>> some other approach in mind?
>>>>
>>>> As I said, Lucene's design doesn't seem to fit this streaming use 
>>>> case
>>>> pattern well. In *this* scenario one could easily do without any
>>>> locking, and without byte level organization in RAMDirectory and
>>>> RAMFile, etc because a single small string isn't a large persistent
>>>> multi-document index.
>>>>
>>>> For some background, here's a small example for the kind of XQuery
>>>> functionality Nux/Lucene integration enables:
>>>>
>>>> (: An XQuery that finds all books authored by James that have 
>>>> something
>>>> to do with "fish", sorted by relevance :)
>>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>>> declare variable $query := "fish*~";
>>>>
>>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>>> $query) > 0.0]
>>>> let $score := lucene:match(string($book), $query)
>>>> order by $score descending
>>>> return (<score>{$score}</score>, $book)
>>>>
>>>> More interestingly one can use this for classifying and routing XML
>>>> messages based on rules (i.e. queries) inspecting their content...
>>>>
>>>> Any other clues about potential improvements would be greatly
>>>> appreciated.
>>>>
>>>> Wolfgang.
>>>>
>>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>>
>>>>> It looks like you are calling that IndexWriter code in some loops,
>>>>> opening it and closing it in every iteration of the loop and also
>>>>> calling optimize.  All of those things could be improved.
>>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>>> only
>>>>> once you are done adding documents to it.
>>>>>
>>>>> See the highlights and the snipets in the first hit:
>>>>>   http://www.lucenebook.com/search?query=when+to+optimize
>>>>>
>>>>> Otis
>>>>>
>>>>>
>>>>> --- Wolfgang Hoschek <whoschek@lbl.gov> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>>> performance for "streaming main memory indexing of single 
>>>>>> strings".
>>>>>> This would help to effectively integrate Lucene with the Nux 
>>>>>> XQuery
>>>>>> engine.
>>>>>>
>>>>>> Below is a small microbenchmark simulating STREAMING XQuery 
>>>>>> fulltext
>>>>>> search as typical for XML network routers, message queuing system,
>>>>>> P2P
>>>>>> networks, etc. In this on-the-fly main memory indexing scenario,

>>>>>> each
>>>>>>
>>>>>> individual string is immediately matched as soon as it becomes
>>>>>> available without any persistance involved. This usage scenario 
>>>>>> and
>>>>>> corresponding performance profile is quite different in 
>>>>>> comparison to
>>>>>>
>>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>>
>>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>>> which
>>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>>>> run
>>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>>> profiler
>>>>>> it seems that most time is spent in and below the following calls:
>>>>>>
>>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>>> writer.addDocument(...);
>>>>>> writer.close();
>>>>>>
>>>>>> I tried quite a few variants of the benchmark with various 
>>>>>> options,
>>>>>> unfortunately with little or no effect.
>>>>>> Lucene just does not seem to designed to do this sort of 
>>>>>> "transient
>>>>>> single string index" thing. All code paths related to opening,
>>>>>> closing,
>>>>>> reading, writing, querying and object creation seem to be designed
>>>>>> for
>>>>>> large persistent indexes.
>>>>>>
>>>>>> Any advice on what I'm missing or what could be done about it 
>>>>>> would
>>>>>> be
>>>>>> greatly appreciated.
>>>>>>
>>>>>> Wolfgang.
>>>>>>
>>>>>> P.S. the benchmark code is attached as a file below:
>>>>>>
>>>>>>> package nux.xom.pool;
>>>>>>
>>>>>> import java.io.IOException;
>>>>>> //import java.io.Reader;
>>>>>>
>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>>> import org.apache.lucene.document.Document;
>>>>>> import org.apache.lucene.document.Field;
>>>>>> //import org.apache.lucene.index.IndexReader;
>>>>>> import org.apache.lucene.index.IndexWriter;
>>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>>> import org.apache.lucene.search.Hits;
>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>> import org.apache.lucene.search.Query;
>>>>>> import org.apache.lucene.search.Searcher;
>>>>>> import org.apache.lucene.store.Directory;
>>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>>
>>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>>
>>>>>> 	private final Analyzer analyzer;
>>>>>> //	private final Directory dir = new RAMDirectory();
>>>>>>
>>>>>> 	public LuceneMatcher() {
>>>>>> 		this(new StandardAnalyzer());
>>>>>> //		this(new SimpleAnalyzer());
>>>>>> //		this(new StopAnalyzer());
>>>>>> //		this(new Analyzer() {
>>>>>> //			public final TokenStream tokenStream(String fieldName, Reader
>>>>>> reader) {
>>>>>> //				return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>>>> //			}
>>>>>> //		});
>>>>>> 	}
>>>>>>
>>>>>> 	public LuceneMatcher(Analyzer analyzer) {
>>>>>> 		if (analyzer == null)
>>>>>> 			throw new IllegalArgumentException("analyzer must not be 
>>>>>> null");
>>>>>> 		this.analyzer = analyzer;
>>>>>> 	}
>>>>>>
>>>>>> 	public Query parseQuery(String expression) throws ParseException

>>>>>> {
>>>>>> 		QueryParser parser = new QueryParser("content", analyzer);
>>>>>> //		parser.setPhraseSlop(0);
>>>>>> 		return parser.parse(expression);
>>>>>> 	}
>>>>>>
>>>>>> 	/**
>>>>>> 	 * Returns the relevance score by matching the given index 
>>>>>> against
>>>>>> the given
>>>>>> 	 * Lucene query expression. The index must not contain more than

>>>>>> one
>>>>>> Lucene
>>>>>> 	 * "document" (aka string to be searched).
>>>>>> 	 */
>>>>>> 	public float match(Directory index, Query query) {
>>>>>> 		Searcher searcher = null;
>>>>>> 		try {
>>>>>> 			searcher = new IndexSearcher(index);
>>>>>> 			Hits hits = searcher.search(query);
>>>>>> 			float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>>> 			return score;
>>>>>> 		} catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> 			throw new RuntimeException(e);
>>>>>> 		} finally {
>>>>>> 			try {
>>>>>> 				if (searcher != null) searcher.close();
>>>>>> 			} catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> 				throw new RuntimeException(e);
>>>>>> 			}
>>>>>> 		}
>>>>>> 	}
>>>>>>
>>>>>> //	public float match(String text, Query query) {
>>>>>> //		return match(createIndex(text), query);
>>>>>> //	}
>>>>>>
>>>>>> 	public Directory createIndex(String text) {
>>>>>> 		Directory dir = new RAMDirectory();
>>>>>> 		IndexWriter writer = null;
>>>>>> 		try {
>>>>>> 			writer = new IndexWriter(dir, analyzer, true);
>>>>>> //			writer.setUseCompoundFile(false);
>>>>>> //			writer.mergeFactor = 2;
>>>>>> //			writer.minMergeDocs = 1;
>>>>>> //			writer.maxMergeDocs = 1;
>>>>>>
>>>>>> 			writer.addDocument(createDocument(text));
>>>>>> //			writer.optimize();
>>>>>> 			return dir;
>>>>>> 		} catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> 			throw new RuntimeException(e);
>>>>>> 		} finally {
>>>>>> 			try {
>>>>>> 				if (writer != null) writer.close();
>>>>>> 			} catch (IOException e) { // should never happen (RAMDirectory)
>>>>>> 				throw new RuntimeException(e);
>>>>>> 			}
>>>>>> 		}
>>>>>> 	}
>>>>>>
>>>>>> 	private Document createDocument(String content) {
>>>>>> 		Document doc = new Document();
>>>>>> 		doc.add(Field.UnStored("content", content));
>>>>>> //		doc.add(Field.Text("x", content));
>>>>>> 		return doc;
>>>>>> 	}
>>>>>>
>>>>>> 	/**
>>>>>> 	 * Lucene microbenchmark simulating STREAMING XQuery fulltext 
>>>>>> search
>>>>>> as
>>>>>> 	 * typical for XML network routers, message queuing system, P2P
>>>>>> networks,
>>>>>> 	 * etc. In this on-the-fly main memory indexing scenario, each
>>>>>> individual
>>>>>> 	 * string is immediately matched as soon as it becomes available
>>>>>> without any
>>>>>> 	 * persistance involved. This usage scenario and corresponding
>>>>>> performance
>>>>>> 	 * profile is quite different in comparison to fulltext search 
>>>>>> over
>>>>>> 	 * persistent (read-mostly) indexes.
>>>>>> 	 *
>>>>>> 	 * Example XPath: 
>>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>>> 	 * "James") > 0.0])
>>>>>> 	 */
>>>>>> 	public static void main(String[] args) throws Exception {
>>>>>> 		int k = -1;
>>>>>> 		int runs = 5;
>>>>>> 		if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>>
>>>>>> 		int nodes = 10000;
>>>>>> 		if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>>
>>>>>> 		String content = "James is out in the woods";
>>>>>> 		if (args.length > ++k) content = args[k];
>>>>>>
>>>>>> 		String expression = "James";
>>>>>> 		if (args.length > ++k) expression = args[k];
>>>>>>
>>>>>> 		LuceneMatcher matcher = new LuceneMatcher();
>>>>>> 		Query query = matcher.parseQuery(expression); // to be reused N
>>>>>> times
>>>>>>
>>>>>> 		for (int r = 0; r < runs; r++) {
>>>>>> 			long start = System.currentTimeMillis();
>>>>>> 			int matches = 0;
>>>>>>
>>>>>> 			for (int i = 0; i < nodes; i++) {
>>>>>> //				if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>>> 				if (matcher.match(matcher.createIndex(content + i), query) >
>>>>>> 0.0f) {
>>>>>> 					matches++;
>>>>>> 				}
>>>>>> 			}
>>>>>>
>>>>>> 			long end = System.currentTimeMillis();
>>>>>> 			System.out.println("matches=" + matches);
>>>>>> 			System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>>> 			System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>>> 1000.0f)));
>>>>>> 			System.out.println();
>>>>>> 		}
>>>>>> 	}
>>>>>> }
>>
>> public class StringIndexReader extends IndexReader {
>>   private List terms;
>>   public StringIndexReader(String strings[]) {
>>     super(null);
>>     terms = Arrays.asList(strings);
>>     Collections.sort(terms);
>>   }
>>
>>   public TermFreqVector[] getTermFreqVectors(int docNumber) throws 
>> IOException {
>>     System.out.println("StringIndexReader.getTermFreqVectors");
>>     return new TermFreqVector[0];
>>   }
>>
>>   public TermFreqVector getTermFreqVector(int docNumber, String 
>> field) throws IOException {
>>     System.out.println("StringIndexReader.getTermFreqVector");
>>     return null;
>>   }
>>
>>   public int numDocs() {
>>     System.out.println("StringIndexReader.numDocs");
>>     return 1;
>>   }
>>
>>   public int maxDoc() {
>>     System.out.println("StringIndexReader.maxDoc");
>>     return 1;
>>   }
>>
>>   public Document document(int n) throws IOException {
>>     System.out.println("StringIndexReader.document");
>>     return null;
>>   }
>>
>>   public boolean isDeleted(int n) {
>>     System.out.println("StringIndexReader.isDeleted");
>>     return false;
>>   }
>>
>>   public boolean hasDeletions() {
>>     System.out.println("StringIndexReader.hasDeletions");
>>     return false;
>>   }
>>
>>   public byte[] norms(String field) throws IOException {
>>     // TODO: what value to use for this?
>>     System.out.println("StringIndexReader.norms: " + field);
>>     return new byte[] { 1 };
>>   }
>>
>>   public void norms(String field, byte[] bytes, int offset) throws 
>> IOException {
>>     System.out.println("StringIndexReader.norms: " + field + "*");
>>   }
>>
>>   protected void doSetNorm(int doc, String field, byte value) throws 
>> IOException {
>>     System.out.println("StringIndexReader.doSetNorm");
>>
>>   }
>>
>>   public TermEnum terms() throws IOException {
>>     System.out.println("StringIndexReader.terms");
>>     return terms(null);
>>   }
>>
>>   public TermEnum terms(final Term term) throws IOException {
>>     System.out.println("StringIndexReader.terms: " + term);
>>
>>     TermEnum termEnum = new TermEnum() {
>>       private String currentTerm;
>>       private Iterator iter;
>>
>>       public boolean next() {
>>         System.out.println("TermEnum.next");
>>         if (iter.hasNext())
>>           currentTerm = (String) iter.next();
>>         return iter.hasNext();
>>       }
>>
>>       public Term term() {
>>         if (iter == null) {
>>           iter = terms.iterator();
>>           while (next()) {
>>             if (currentTerm.startsWith(term.text()))
>>               break;
>>           }
>>         }
>>         System.out.println("TermEnum.term: " + currentTerm);
>>         return new Term(term.field(), currentTerm);
>>       }
>>
>>       public int docFreq() {
>>         System.out.println("TermEnum.docFreq");
>>         return 1;
>>       }
>>
>>       public void close() {
>>         System.out.println("TermEnum.close");
>>       }
>>     };
>>     return termEnum;
>>   }
>>
>>   public int docFreq(Term term) throws IOException {
>>     System.out.println("StringIndexReader.docFreq: " + term);
>>     return terms.contains(term.text()) ? 1 : 0;
>>   }
>>
>>   public TermDocs termDocs() throws IOException {
>>     System.out.println("StringIndexReader.termDocs");
>>
>>     TermDocs td = new TermDocs() {
>>       private boolean done = false;
>>       String currentTerm;
>>
>>       public void seek(Term term) {
>>         System.out.println(".seek: " + term);
>>         currentTerm = term.text();
>>         done = false;
>>       }
>>
>>       public void seek(TermEnum termEnum) {
>>         seek(termEnum.term());
>>       }
>>
>>       public int doc() {
>>         System.out.println(".doc");
>>         return 0;
>>       }
>>
>>       public int freq() {
>>         System.out.println(".freq");
>>         return 1;
>>       }
>>
>>       public boolean next() {
>>         System.out.println(".next");
>>         return false;
>>       }
>>
>>       public int read(int[] docs, int[] freqs) {
>>         System.out.println(".read: " + docs.length);
>>
>>         if (done) return 0;
>>
>>         done = true;
>>         docs[0] = 0;
>>         freqs[0] = freq();
>>         return 1;
>>       }
>>
>>       public boolean skipTo(int target) {
>>         System.out.println(".skipTo");
>>         return false;
>>       }
>>
>>       public void close() {
>>         System.out.println(".close");
>>
>>       }
>>     };
>>     return td;
>>   }
>>
>>   public TermPositions termPositions() throws IOException {
>>     System.out.println("StringIndexReader.termPositions");
>>     return null;
>>   }
>>
>>   protected void doDelete(int docNum) throws IOException {
>>     System.out.println("StringIndexReader.doDelete");
>>
>>   }
>>
>>   protected void doUndeleteAll() throws IOException {
>>     System.out.println("StringIndexReader.doUndeleteAll");
>>
>>   }
>>
>>   protected void doCommit() throws IOException {
>>     System.out.println("StringIndexReader.doCommit");
>>
>>   }
>>
>>   protected void doClose() throws IOException {
>>     System.out.println("StringIndexReader.doClose");
>>
>>   }
>>
>>   public Collection getFieldNames() throws IOException {
>>     System.out.println("StringIndexReader.getFieldNames");
>>     return null;
>>   }
>>
>>   public Collection getFieldNames(boolean indexed) throws IOException 
>> {
>>     System.out.println("StringIndexReader.getFieldNames");
>>     return null;
>>   }
>>
>>   public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>>     System.out.println("StringIndexReader.getIndexedFieldNames");
>>     return null;
>>   }
>>
>>   public Collection getFieldNames(FieldOption fldOption) {
>>     System.out.println("StringIndexReader.getFieldNames");
>>     return null;
>>   }
>>
>>   public static void main(String[] args) {
>>     IndexReader reader = new StringIndexReader(new String[] {"foo", 
>> "bar", "baz"});
>>     IndexSearcher searcher = new IndexSearcher(reader);
>>
>>     Hits hits = null;
>>     try {
>>       hits = searcher.search(new WildcardQuery(new 
>> Term("field","ba*")));
>>     } catch (IOException e) {
>>       e.printStackTrace();
>>     }
>>     System.out.println("found " + hits.length());
>>   }
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message