lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <whosc...@lbl.gov>
Subject Re: [Performance] Streaming main memory indexing of single strings
Date Fri, 15 Apr 2005 22:15:29 GMT
Cool! For my use case it would need to be able to handle arbitrary 
queries (previously parsed from a general lucene query string). 
Something like:

	float match(String Text, Query query)

it's fine with me if it also works for

	float[] match(String[] texts, Query query) or
	float(Document doc, Query query)

but that isn't required by the use case.

Wolfgang.

> I am intrigued by this and decided to mock a quick and dirty example 
> of such an IndexReader.  After a little trial-and-error I got it 
> working at least for TermQuery and WildcardQuery.  I've pasted my code 
> below as an example, but there is much room for improvement, 
> especially in terms of performance and also in keeping track of term 
> frequency, and also it would be nicer if it handled the analysis 
> internally.
>
> I think something like this would make a handy addition to our contrib 
> area at least.  I'd be happy to receive improvements to this and then 
> add it to a contrib subproject.
>
> Perhaps this would be a handy way to handle situations where users 
> have queries saved in a system and need to be alerted whenever a new 
> document arrives matching the saved queries?
>
> 	Erik
>
>
>
>>
>>
>> -----Original Message-----
>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>> Sent: Thursday, April 14, 2005 4:04 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: [Performance] Streaming main memory indexing of single
>> strings
>>
>>
>> This seems to be a promising avenue worth exploring. My gutfeeling is
>> that this could easily be 10-100 times faster.
>>
>> The drawback is that it requires a fair amount of understanding of
>> intricate Lucene internals, pulling those pieces together and adapting
>> them as required for the seemingly simple "float match(String text,
>> Query query)".
>>
>> I might give it a shot but I'm not sure I'll be able to pull this off!
>> Is there any similar code I could look at as a starting point?
>>
>> Wolfgang.
>>
>> On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
>>
>>> I think you are not approaching this the correct way.
>>>
>>> Pseudo code:
>>>
>>> Subclass IndexReader.
>>>
>>> Get tokens from String 'document' using Lucene analyzers.
>>>
>>> Build simple hash-map based data structures using tokens for terms,
>>> and term
>>> positions.
>>>
>>> reimplement termDocs() and termPositions() to use the structures from
>>> above.
>>>
>>> run searches.
>>>
>>> start again with next document.
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Wolfgang Hoschek [mailto:whoschek@lbl.gov]
>>> Sent: Thursday, April 14, 2005 2:56 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [Performance] Streaming main memory indexing of single
>>> strings
>>>
>>>
>>> Otis, this might be a misunderstanding.
>>>
>>> - I'm not calling optimize(). That piece is commented out you if look
>>> again at the code.
>>> - The *streaming* use case requires that for each query I add one 
>>> (and
>>> only one) document (aka string) to an empty index:
>>>
>>> repeat N times (where N is millions or billions):
>>> 	add a single string (aka document) to an empty index
>>> 	query the index
>>> 	drop index (or delete it's document)
>>>
>>> with the following API being called N times: float match(String text,
>>> Query query)
>>>
>>> So there's no possibility of adding many documents and thereafter
>>> running the query. This in turn seems to mean that the IndexWriter
>>> can't be kept open - unless I manually delete each document after 
>>> each
>>> query to repeatedly reuse the RAMDirectory, which I've also tried
>>> before without any significant performance gain - deletion seems to
>>> have substantial overhead in itself. Perhaps it would be better if
>>> there were a Directory.deleteAllDocuments() or similar. Did you have
>>> some other approach in mind?
>>>
>>> As I said, Lucene's design doesn't seem to fit this streaming use 
>>> case
>>> pattern well. In *this* scenario one could easily do without any
>>> locking, and without byte level organization in RAMDirectory and
>>> RAMFile, etc because a single small string isn't a large persistent
>>> multi-document index.
>>>
>>> For some background, here's a small example for the kind of XQuery
>>> functionality Nux/Lucene integration enables:
>>>
>>> (: An XQuery that finds all books authored by James that have 
>>> something
>>> to do with "fish", sorted by relevance :)
>>> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
>>> declare variable $query := "fish*~";
>>>
>>> for $book in /books/book[author="James" and lucene:match(string(.),
>>> $query) > 0.0]
>>> let $score := lucene:match(string($book), $query)
>>> order by $score descending
>>> return (<score>{$score}</score>, $book)
>>>
>>> More interestingly one can use this for classifying and routing XML
>>> messages based on rules (i.e. queries) inspecting their content...
>>>
>>> Any other clues about potential improvements would be greatly
>>> appreciated.
>>>
>>> Wolfgang.
>>>
>>> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>>>
>>>> It looks like you are calling that IndexWriter code in some loops,
>>>> opening it and closing it in every iteration of the loop and also
>>>> calling optimize.  All of those things could be improved.
>>>> Keep your IndexWriter open, don't close it, and optimize the index
>>>> only
>>>> once you are done adding documents to it.
>>>>
>>>> See the highlights and the snipets in the first hit:
>>>>   http://www.lucenebook.com/search?query=when+to+optimize
>>>>
>>>> Otis
>>>>
>>>>
>>>> --- Wolfgang Hoschek <whoschek@lbl.gov> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm wondering if anyone could let me know how to improve Lucene
>>>>> performance for "streaming main memory indexing of single strings".
>>>>> This would help to effectively integrate Lucene with the Nux XQuery
>>>>> engine.
>>>>>
>>>>> Below is a small microbenchmark simulating STREAMING XQuery 
>>>>> fulltext
>>>>> search as typical for XML network routers, message queuing system,
>>>>> P2P
>>>>> networks, etc. In this on-the-fly main memory indexing scenario, 
>>>>> each
>>>>>
>>>>> individual string is immediately matched as soon as it becomes
>>>>> available without any persistance involved. This usage scenario and
>>>>> corresponding performance profile is quite different in comparison 
>>>>> to
>>>>>
>>>>> fulltext search over persistent (read-mostly) indexes.
>>>>>
>>>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>>>> which
>>>>> is unfortunate news considering the XQuery engine can easily walk
>>>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>>>> run
>>>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>>>> profiler
>>>>> it seems that most time is spent in and below the following calls:
>>>>>
>>>>> writer = new IndexWriter(dir, analyzer, true);
>>>>> writer.addDocument(...);
>>>>> writer.close();
>>>>>
>>>>> I tried quite a few variants of the benchmark with various options,
>>>>> unfortunately with little or no effect.
>>>>> Lucene just does not seem to designed to do this sort of "transient
>>>>> single string index" thing. All code paths related to opening,
>>>>> closing,
>>>>> reading, writing, querying and object creation seem to be designed
>>>>> for
>>>>> large persistent indexes.
>>>>>
>>>>> Any advice on what I'm missing or what could be done about it would
>>>>> be
>>>>> greatly appreciated.
>>>>>
>>>>> Wolfgang.
>>>>>
>>>>> P.S. the benchmark code is attached as a file below:
>>>>>
>>>>>> package nux.xom.pool;
>>>>>
>>>>> import java.io.IOException;
>>>>> //import java.io.Reader;
>>>>>
>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>>>> //import org.apache.lucene.analysis.TokenStream;
>>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>> import org.apache.lucene.document.Document;
>>>>> import org.apache.lucene.document.Field;
>>>>> //import org.apache.lucene.index.IndexReader;
>>>>> import org.apache.lucene.index.IndexWriter;
>>>>> import org.apache.lucene.queryParser.ParseException;
>>>>> import org.apache.lucene.queryParser.QueryParser;
>>>>> import org.apache.lucene.search.Hits;
>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>> import org.apache.lucene.search.Query;
>>>>> import org.apache.lucene.search.Searcher;
>>>>> import org.apache.lucene.store.Directory;
>>>>> import org.apache.lucene.store.RAMDirectory;
>>>>>
>>>>> public final class LuceneMatcher { // TODO: make non-public
>>>>>
>>>>> 	private final Analyzer analyzer;
>>>>> //	private final Directory dir = new RAMDirectory();
>>>>>
>>>>> 	public LuceneMatcher() {
>>>>> 		this(new StandardAnalyzer());
>>>>> //		this(new SimpleAnalyzer());
>>>>> //		this(new StopAnalyzer());
>>>>> //		this(new Analyzer() {
>>>>> //			public final TokenStream tokenStream(String fieldName, Reader
>>>>> reader) {
>>>>> //				return new PorterStemFilter(new LowerCaseTokenizer(reader));
>>>>> //			}
>>>>> //		});
>>>>> 	}
>>>>>
>>>>> 	public LuceneMatcher(Analyzer analyzer) {
>>>>> 		if (analyzer == null)
>>>>> 			throw new IllegalArgumentException("analyzer must not be null");
>>>>> 		this.analyzer = analyzer;
>>>>> 	}
>>>>>
>>>>> 	public Query parseQuery(String expression) throws ParseException {
>>>>> 		QueryParser parser = new QueryParser("content", analyzer);
>>>>> //		parser.setPhraseSlop(0);
>>>>> 		return parser.parse(expression);
>>>>> 	}
>>>>>
>>>>> 	/**
>>>>> 	 * Returns the relevance score by matching the given index against
>>>>> the given
>>>>> 	 * Lucene query expression. The index must not contain more than 
>>>>> one
>>>>> Lucene
>>>>> 	 * "document" (aka string to be searched).
>>>>> 	 */
>>>>> 	public float match(Directory index, Query query) {
>>>>> 		Searcher searcher = null;
>>>>> 		try {
>>>>> 			searcher = new IndexSearcher(index);
>>>>> 			Hits hits = searcher.search(query);
>>>>> 			float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>>> 			return score;
>>>>> 		} catch (IOException e) { // should never happen (RAMDirectory)
>>>>> 			throw new RuntimeException(e);
>>>>> 		} finally {
>>>>> 			try {
>>>>> 				if (searcher != null) searcher.close();
>>>>> 			} catch (IOException e) { // should never happen (RAMDirectory)
>>>>> 				throw new RuntimeException(e);
>>>>> 			}
>>>>> 		}
>>>>> 	}
>>>>>
>>>>> //	public float match(String text, Query query) {
>>>>> //		return match(createIndex(text), query);
>>>>> //	}
>>>>>
>>>>> 	public Directory createIndex(String text) {
>>>>> 		Directory dir = new RAMDirectory();
>>>>> 		IndexWriter writer = null;
>>>>> 		try {
>>>>> 			writer = new IndexWriter(dir, analyzer, true);
>>>>> //			writer.setUseCompoundFile(false);
>>>>> //			writer.mergeFactor = 2;
>>>>> //			writer.minMergeDocs = 1;
>>>>> //			writer.maxMergeDocs = 1;
>>>>>
>>>>> 			writer.addDocument(createDocument(text));
>>>>> //			writer.optimize();
>>>>> 			return dir;
>>>>> 		} catch (IOException e) { // should never happen (RAMDirectory)
>>>>> 			throw new RuntimeException(e);
>>>>> 		} finally {
>>>>> 			try {
>>>>> 				if (writer != null) writer.close();
>>>>> 			} catch (IOException e) { // should never happen (RAMDirectory)
>>>>> 				throw new RuntimeException(e);
>>>>> 			}
>>>>> 		}
>>>>> 	}
>>>>>
>>>>> 	private Document createDocument(String content) {
>>>>> 		Document doc = new Document();
>>>>> 		doc.add(Field.UnStored("content", content));
>>>>> //		doc.add(Field.Text("x", content));
>>>>> 		return doc;
>>>>> 	}
>>>>>
>>>>> 	/**
>>>>> 	 * Lucene microbenchmark simulating STREAMING XQuery fulltext 
>>>>> search
>>>>> as
>>>>> 	 * typical for XML network routers, message queuing system, P2P
>>>>> networks,
>>>>> 	 * etc. In this on-the-fly main memory indexing scenario, each
>>>>> individual
>>>>> 	 * string is immediately matched as soon as it becomes available
>>>>> without any
>>>>> 	 * persistance involved. This usage scenario and corresponding
>>>>> performance
>>>>> 	 * profile is quite different in comparison to fulltext search 
>>>>> over
>>>>> 	 * persistent (read-mostly) indexes.
>>>>> 	 *
>>>>> 	 * Example XPath: 
>>>>> count(/table/row[lucene:match(string(./firstname),
>>>>> 	 * "James") > 0.0])
>>>>> 	 */
>>>>> 	public static void main(String[] args) throws Exception {
>>>>> 		int k = -1;
>>>>> 		int runs = 5;
>>>>> 		if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>>>
>>>>> 		int nodes = 10000;
>>>>> 		if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>>>
>>>>> 		String content = "James is out in the woods";
>>>>> 		if (args.length > ++k) content = args[k];
>>>>>
>>>>> 		String expression = "James";
>>>>> 		if (args.length > ++k) expression = args[k];
>>>>>
>>>>> 		LuceneMatcher matcher = new LuceneMatcher();
>>>>> 		Query query = matcher.parseQuery(expression); // to be reused N
>>>>> times
>>>>>
>>>>> 		for (int r = 0; r < runs; r++) {
>>>>> 			long start = System.currentTimeMillis();
>>>>> 			int matches = 0;
>>>>>
>>>>> 			for (int i = 0; i < nodes; i++) {
>>>>> //				if (LuceneUtil.match(content + i, expression) > 0.0f) {
>>>>> 				if (matcher.match(matcher.createIndex(content + i), query) >
>>>>> 0.0f) {
>>>>> 					matches++;
>>>>> 				}
>>>>> 			}
>>>>>
>>>>> 			long end = System.currentTimeMillis();
>>>>> 			System.out.println("matches=" + matches);
>>>>> 			System.out.println("secs=" + ((end-start) / 1000.0f));
>>>>> 			System.out.println("queries/sec=" + (nodes / ((end-start) /
>>>>> 1000.0f)));
>>>>> 			System.out.println();
>>>>> 		}
>>>>> 	}
>>>>> }
>
> public class StringIndexReader extends IndexReader {
>   private List terms;
>   public StringIndexReader(String strings[]) {
>     super(null);
>     terms = Arrays.asList(strings);
>     Collections.sort(terms);
>   }
>
>   public TermFreqVector[] getTermFreqVectors(int docNumber) throws 
> IOException {
>     System.out.println("StringIndexReader.getTermFreqVectors");
>     return new TermFreqVector[0];
>   }
>
>   public TermFreqVector getTermFreqVector(int docNumber, String field) 
> throws IOException {
>     System.out.println("StringIndexReader.getTermFreqVector");
>     return null;
>   }
>
>   public int numDocs() {
>     System.out.println("StringIndexReader.numDocs");
>     return 1;
>   }
>
>   public int maxDoc() {
>     System.out.println("StringIndexReader.maxDoc");
>     return 1;
>   }
>
>   public Document document(int n) throws IOException {
>     System.out.println("StringIndexReader.document");
>     return null;
>   }
>
>   public boolean isDeleted(int n) {
>     System.out.println("StringIndexReader.isDeleted");
>     return false;
>   }
>
>   public boolean hasDeletions() {
>     System.out.println("StringIndexReader.hasDeletions");
>     return false;
>   }
>
>   public byte[] norms(String field) throws IOException {
>     // TODO: what value to use for this?
>     System.out.println("StringIndexReader.norms: " + field);
>     return new byte[] { 1 };
>   }
>
>   public void norms(String field, byte[] bytes, int offset) throws 
> IOException {
>     System.out.println("StringIndexReader.norms: " + field + "*");
>   }
>
>   protected void doSetNorm(int doc, String field, byte value) throws 
> IOException {
>     System.out.println("StringIndexReader.doSetNorm");
>
>   }
>
>   public TermEnum terms() throws IOException {
>     System.out.println("StringIndexReader.terms");
>     return terms(null);
>   }
>
>   public TermEnum terms(final Term term) throws IOException {
>     System.out.println("StringIndexReader.terms: " + term);
>
>     TermEnum termEnum = new TermEnum() {
>       private String currentTerm;
>       private Iterator iter;
>
>       public boolean next() {
>         System.out.println("TermEnum.next");
>         if (iter.hasNext())
>           currentTerm = (String) iter.next();
>         return iter.hasNext();
>       }
>
>       public Term term() {
>         if (iter == null) {
>           iter = terms.iterator();
>           while (next()) {
>             if (currentTerm.startsWith(term.text()))
>               break;
>           }
>         }
>         System.out.println("TermEnum.term: " + currentTerm);
>         return new Term(term.field(), currentTerm);
>       }
>
>       public int docFreq() {
>         System.out.println("TermEnum.docFreq");
>         return 1;
>       }
>
>       public void close() {
>         System.out.println("TermEnum.close");
>       }
>     };
>     return termEnum;
>   }
>
>   public int docFreq(Term term) throws IOException {
>     System.out.println("StringIndexReader.docFreq: " + term);
>     return terms.contains(term.text()) ? 1 : 0;
>   }
>
>   public TermDocs termDocs() throws IOException {
>     System.out.println("StringIndexReader.termDocs");
>
>     TermDocs td = new TermDocs() {
>       private boolean done = false;
>       String currentTerm;
>
>       public void seek(Term term) {
>         System.out.println(".seek: " + term);
>         currentTerm = term.text();
>         done = false;
>       }
>
>       public void seek(TermEnum termEnum) {
>         seek(termEnum.term());
>       }
>
>       public int doc() {
>         System.out.println(".doc");
>         return 0;
>       }
>
>       public int freq() {
>         System.out.println(".freq");
>         return 1;
>       }
>
>       public boolean next() {
>         System.out.println(".next");
>         return false;
>       }
>
>       public int read(int[] docs, int[] freqs) {
>         System.out.println(".read: " + docs.length);
>
>         if (done) return 0;
>
>         done = true;
>         docs[0] = 0;
>         freqs[0] = freq();
>         return 1;
>       }
>
>       public boolean skipTo(int target) {
>         System.out.println(".skipTo");
>         return false;
>       }
>
>       public void close() {
>         System.out.println(".close");
>
>       }
>     };
>     return td;
>   }
>
>   public TermPositions termPositions() throws IOException {
>     System.out.println("StringIndexReader.termPositions");
>     return null;
>   }
>
>   protected void doDelete(int docNum) throws IOException {
>     System.out.println("StringIndexReader.doDelete");
>
>   }
>
>   protected void doUndeleteAll() throws IOException {
>     System.out.println("StringIndexReader.doUndeleteAll");
>
>   }
>
>   protected void doCommit() throws IOException {
>     System.out.println("StringIndexReader.doCommit");
>
>   }
>
>   protected void doClose() throws IOException {
>     System.out.println("StringIndexReader.doClose");
>
>   }
>
>   public Collection getFieldNames() throws IOException {
>     System.out.println("StringIndexReader.getFieldNames");
>     return null;
>   }
>
>   public Collection getFieldNames(boolean indexed) throws IOException {
>     System.out.println("StringIndexReader.getFieldNames");
>     return null;
>   }
>
>   public Collection getIndexedFieldNames(Field.TermVector tvSpec) {
>     System.out.println("StringIndexReader.getIndexedFieldNames");
>     return null;
>   }
>
>   public Collection getFieldNames(FieldOption fldOption) {
>     System.out.println("StringIndexReader.getFieldNames");
>     return null;
>   }
>
>   public static void main(String[] args) {
>     IndexReader reader = new StringIndexReader(new String[] {"foo", 
> "bar", "baz"});
>     IndexSearcher searcher = new IndexSearcher(reader);
>
>     Hits hits = null;
>     try {
>       hits = searcher.search(new WildcardQuery(new 
> Term("field","ba*")));
>     } catch (IOException e) {
>       e.printStackTrace();
>     }
>     System.out.println("found " + hits.length());
>   }
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message