lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <reng...@ix.netcom.com>
Subject RE: [Performance] Streaming main memory indexing of single strings
Date Thu, 14 Apr 2005 21:11:44 GMT
It is really not that involved. Just implement the abstract methods of
IndexReader. And many cane be no-op'd because they will never be called in a
"read only" situation.

Methods related to normalization and such can also be no-op'd because you
are only dealing with a single document.

I would think you will find this approach at least an order of magnitude
faster, if not 2.


-----Original Message-----
From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
Sent: Thursday, April 14, 2005 4:04 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings


This seems to be a promising avenue worth exploring. My gutfeeling is
that this could easily be 10-100 times faster.

The drawback is that it requires a fair amount of understanding of
intricate Lucene internals, pulling those pieces together and adapting
them as required for the seemingly simple "float match(String text,
Query query)".

I might give it a shot but I'm not sure I'll be able to pull this off!
Is there any similar code I could look at as a starting point?

Wolfgang.

On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:

> I think you are not approaching this the correct way.
>
> Pseudo code:
>
> Subclass IndexReader.
>
> Get tokens from String 'document' using Lucene analyzers.
>
> Build simple hash-map based data structures using tokens for terms,
> and term
> positions.
>
> reimplement termDocs() and termPositions() to use the structures from
> above.
>
> run searches.
>
> start again with next document.
>
>
>
> -----Original Message-----
> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
> Sent: Thursday, April 14, 2005 2:56 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [Performance] Streaming main memory indexing of single
> strings
>
>
> Otis, this might be a misunderstanding.
>
> - I'm not calling optimize(). That piece is commented out you if look
> again at the code.
> - The *streaming* use case requires that for each query I add one (and
> only one) document (aka string) to an empty index:
>
> repeat N times (where N is millions or billions):
> 	add a single string (aka document) to an empty index
> 	query the index
> 	drop index (or delete it's document)
>
> with the following API being called N times: float match(String text,
> Query query)
>
> So there's no possibility of adding many documents and thereafter
> running the query. This in turn seems to mean that the IndexWriter
> can't be kept open - unless I manually delete each document after each
> query to repeatedly reuse the RAMDirectory, which I've also tried
> before without any significant performance gain - deletion seems to
> have substantial overhead in itself. Perhaps it would be better if
> there were a Directory.deleteAllDocuments() or similar. Did you have
> some other approach in mind?
>
> As I said, Lucene's design doesn't seem to fit this streaming use case
> pattern well. In *this* scenario one could easily do without any
> locking, and without byte level organization in RAMDirectory and
> RAMFile, etc because a single small string isn't a large persistent
> multi-document index.
>
> For some background, here's a small example for the kind of XQuery
> functionality Nux/Lucene integration enables:
>
> (: An XQuery that finds all books authored by James that have something
> to do with "fish", sorted by relevance :)
> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
> declare variable $query := "fish*~";
>
> for $book in /books/book[author="James" and lucene:match(string(.),
> $query) > 0.0]
> let $score := lucene:match(string($book), $query)
> order by $score descending
> return (<score>{$score}</score>, $book)
>
> More interestingly one can use this for classifying and routing XML
> messages based on rules (i.e. queries) inspecting their content...
>
> Any other clues about potential improvements would be greatly
> appreciated.
>
> Wolfgang.
>
> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>
>> It looks like you are calling that IndexWriter code in some loops,
>> opening it and closing it in every iteration of the loop and also
>> calling optimize.  All of those things could be improved.
>> Keep your IndexWriter open, don't close it, and optimize the index
>> only
>> once you are done adding documents to it.
>>
>> See the highlights and the snipets in the first hit:
>>   http://www.lucenebook.com/search?query=when+to+optimize
>>
>> Otis
>>
>>
>> --- Wolfgang Hoschek <whoschek@lbl.gov> wrote:
>>
>>> Hi,
>>>
>>> I'm wondering if anyone could let me know how to improve Lucene
>>> performance for "streaming main memory indexing of single strings".
>>> This would help to effectively integrate Lucene with the Nux XQuery
>>> engine.
>>>
>>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>>> search as typical for XML network routers, message queuing system,
>>> P2P
>>> networks, etc. In this on-the-fly main memory indexing scenario, each
>>>
>>> individual string is immediately matched as soon as it becomes
>>> available without any persistance involved. This usage scenario and
>>> corresponding performance profile is quite different in comparison to
>>>
>>> fulltext search over persistent (read-mostly) indexes.
>>>
>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>> which
>>> is unfortunate news considering the XQuery engine can easily walk
>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>> run
>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>> profiler
>>> it seems that most time is spent in and below the following calls:
>>>
>>> writer = new IndexWriter(dir, analyzer, true);
>>> writer.addDocument(...);
>>> writer.close();
>>>
>>> I tried quite a few variants of the benchmark with various options,
>>> unfortunately with little or no effect.
>>> Lucene just does not seem to designed to do this sort of "transient
>>> single string index" thing. All code paths related to opening,
>>> closing,
>>> reading, writing, querying and object creation seem to be designed
>>> for
>>> large persistent indexes.
>>>
>>> Any advice on what I'm missing or what could be done about it would
>>> be
>>> greatly appreciated.
>>>
>>> Wolfgang.
>>>
>>> P.S. the benchmark code is attached as a file below:
>>>
>>>> package nux.xom.pool;
>>>
>>> import java.io.IOException;
>>> //import java.io.Reader;
>>>
>>> import org.apache.lucene.analysis.Analyzer;
>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>> //import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> //import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.queryParser.ParseException;
>>> import org.apache.lucene.queryParser.QueryParser;
>>> import org.apache.lucene.search.Hits;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.Query;
>>> import org.apache.lucene.search.Searcher;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.RAMDirectory;
>>>
>>> public final class LuceneMatcher { // TODO: make non-public
>>>
>>> 	private final Analyzer analyzer;
>>> //	private final Directory dir = new RAMDirectory();
>>>
>>> 	public LuceneMatcher() {
>>> 		this(new StandardAnalyzer());
>>> //		this(new SimpleAnalyzer());
>>> //		this(new StopAnalyzer());
>>> //		this(new Analyzer() {
>>> //			public final TokenStream tokenStream(String fieldName, Reader
>>> reader) {
>>> //				return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>> //			}
>>> //		});
>>> 	}
>>>
>>> 	public LuceneMatcher(Analyzer analyzer) {
>>> 		if (analyzer == null)
>>> 			throw new IllegalArgumentException("analyzer must not be null");
>>> 		this.analyzer = analyzer;
>>> 	}
>>>
>>> 	public Query parseQuery(String expression) throws ParseException {
>>> 		QueryParser parser = new QueryParser("content", analyzer);
>>> //		parser.setPhraseSlop(0);
>>> 		return parser.parse(expression);
>>> 	}
>>>
>>> 	/**
>>> 	 * Returns the relevance score by matching the given index against
>>> the given
>>> 	 * Lucene query expression. The index must not contain more than one
>>> Lucene
>>> 	 * "document" (aka string to be searched).
>>> 	 */
>>> 	public float match(Directory index, Query query) {
>>> 		Searcher searcher = null;
>>> 		try {
>>> 			searcher = new IndexSearcher(index);
>>> 			Hits hits = searcher.search(query);
>>> 			float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>> 			return score;
>>> 		} catch (IOException e) { // should never happen (RAMDirectory)
>>> 			throw new RuntimeException(e);
>>> 		} finally {
>>> 			try {
>>> 				if (searcher != null) searcher.close();
>>> 			} catch (IOException e) { // should never happen (RAMDirectory)
>>> 				throw new RuntimeException(e);
>>> 			}
>>> 		}
>>> 	}
>>>
>>> //	public float match(String text, Query query) {
>>> //		return match(createIndex(text), query);
>>> //	}
>>>
>>> 	public Directory createIndex(String text) {
>>> 		Directory dir = new RAMDirectory();
>>> 		IndexWriter writer = null;
>>> 		try {
>>> 			writer = new IndexWriter(dir, analyzer, true);
>>> //			writer.setUseCompoundFile(false);
>>> //			writer.mergeFactor = 2;
>>> //			writer.minMergeDocs = 1;
>>> //			writer.maxMergeDocs = 1;
>>>
>>> 			writer.addDocument(createDocument(text));
>>> //			writer.optimize();
>>> 			return dir;
>>> 		} catch (IOException e) { // should never happen (RAMDirectory)
>>> 			throw new RuntimeException(e);
>>> 		} finally {
>>> 			try {
>>> 				if (writer != null) writer.close();
>>> 			} catch (IOException e) { // should never happen (RAMDirectory)
>>> 				throw new RuntimeException(e);
>>> 			}
>>> 		}
>>> 	}
>>>
>>> 	private Document createDocument(String content) {
>>> 		Document doc = new Document();
>>> 		doc.add(Field.UnStored("content", content));
>>> //		doc.add(Field.Text("x", content));
>>> 		return doc;
>>> 	}
>>>
>>> 	/**
>>> 	 * Lucene microbenchmark simulating STREAMING XQuery fulltext search
>>> as
>>> 	 * typical for XML network routers, message queuing system, P2P
>>> networks,
>>> 	 * etc. In this on-the-fly main memory indexing scenario, each
>>> individual
>>> 	 * string is immediately matched as soon as it becomes available
>>> without any
>>> 	 * persistance involved. This usage scenario and corresponding
>>> performance
>>> 	 * profile is quite different in comparison to fulltext search over
>>> 	 * persistent (read-mostly) indexes.
>>> 	 *
>>> 	 * Example XPath: count(/table/row[lucene:match(string(./firstname),
>>> 	 * "James") > 0.0])
>>> 	 */
>>> 	public static void main(String[] args) throws Exception {
>>> 		int k = -1;
>>> 		int runs = 5;
>>> 		if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>
>>> 		int nodes = 10000;
>>> 		if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>
>>> 		String content = "James is out in the woods";
>>> 		if (args.length > ++k) content = args[k];
>>>
>>> 		String expression = "James";
>>> 		if (args.length > ++k) expression = args[k];
>>>
>>> 		LuceneMatcher matcher = new LuceneMatcher();
>>> 		Query query = matcher.parseQuery(expression); // to be reused N
>>> times
>>>
>>> 		for (int r = 0; r < runs; r++) {
>>> 			long start = System.currentTimeMillis();
>>> 			int matches = 0;
>>>
>>> 			for (int i = 0; i < nodes; i++) {
>>> //				if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>> 				if (matcher.match(matcher.createIndex(content + i), query) >
>>> 0.0f) {
>>> 					matches++;
>>> 				}
>>> 			}
>>>
>>> 			long end = System.currentTimeMillis();
>>> 			System.out.println("matches=" + matches);
>>> 			System.out.println("secs=" + ((end-start) / 1000.0f));
>>> 			System.out.println("queries/sec=" + (nodes / ((end-start) /
>>> 1000.0f)));
>>> 			System.out.println();
>>> 		}
>>> 	}
>>> }
>>>>
>>>
>>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message