lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khawaja Shams <kssh...@gmail.com>
Subject Re: Binary indexing / query efficiency
Date Tue, 14 Apr 2009 22:40:17 GMT
Hi,
  It is not a good idea to extract each document. You can be more efficient
by only looking at the fields you are interested in. Depending on the size
of your index, you can try:

String[] codes = FieldCache.DEFAULT.getStrings(indexReader, fieldName);


This returns a string [] with the length being the number of documents in
your index. If you are doing faceted searching, you may want to try:

StringIndex stringIndex = FieldCache.DEFAULT.getStringIndex(indexReader,
fieldName);

The StringIndex class has a lookup array and an order array. The order array
contains a value for each document id, and you can use this value to extract
the string from the lookup array once you are done counting.


Perhaps the Lucene experts can shed light on a better approach.

You may also want to look at SOLR for faceted searching support :). HTH.


Regards,
Khawaja Shams

On Tue, Apr 14, 2009 at 11:12 AM, Eger, Patrick <peger@automotive.com>wrote:

> Hi, was recently looking to incorporate Lucene for a simple
> "parametric"/"faceted" type search.  The documents are very small,
> roughly 15 fields of short length (5-15 characters, generally strings
> and padded integers). When profiling query performance of our
> application, which inserts 1 million documents then
>  1) filters on 1-3 fields with simple boolean/term matches
>  2) stores these docids in a BitSet
>  3) calls IndexSearcher.doc() to retrieve all matching documents (all
> fields, 100 - 1,000,000 results per call)
>
> It turns out that 98% of the query time was spent not actually doing the
> query, but within the IndexSearcher.doc() call.
>
> My first question is, is there any way to more efficiently get
> (all/most) of the fields for a set of documents, other than iterating
> and calling doc()?
>
> Additionally, is there any way (or planned feature) to index *binary*
> data? Using a profiler, I have determined that String decoding is a
> significant performance limiter for my use-case:
>
> 90% of the application time is spent in this method:
> ---------------------------------------
> org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
> boolean, boolean, boolean)
>
>
> 46% of the application time is spent decoding strings (half of the above
> addField() time):
> ---------------------------------------org.apache.lucene.store.IndexInpu
> t.readString()
>        java.lang.String.<init>(byte[], int, int, String)
>                java.lang.StringCoding.decode(String, byte[], int, int)
>
> java.lang.StringCoding$StringDecoder.decode(byte[], int, int)
>
> (YJP profiler output available if needed)
>
> String.intern() was my top hot spot, but my patch was accepted and fixed
> this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
> familiar enough with the lucene codebase to figure out the above though,
> so thought I would ask.
>
>
>
> //ideally i'd be able to do add a binary field as such:
> doc.add(new Field("f1",new
> byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));
>
> //then query like:
> Query q = new TermQuery(new Term("f1",byte[]{1,2,3,4}))
> searcher.search(q,...);
>
> Which would allow me to avoid the Integer -> String -> Padded String ->
> String -> Integer coding/decoding to index an integer, and avoid Object
> -> String -> Object conversion (which per above is quite expensive).
>
>
> Thanks for any help!
>
>
> Regards,
>
> Patrick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message