lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: custom low-level indexer (to speed things up) when fields, terms and docids are in order
Date Fri, 26 Mar 2010 13:01:26 GMT
This sounds like fun :)

So you've already created a custom indexing chain, and plugged this
into DocumentsWriter?  And this chain directly interacts with the low
level classes for writing a segment (FormatPostingsTerms/DocsConsumer,
etc.)?

I'm not sure you're gonna do much better than that... these classes
already expect things "in order" and all they do (pretty much) is
write the index files.  I think they should be pretty lean...

Also, once flex lands, soon (note that it moves these low level
interfaces/classes around)... since you're using these classes for
writing, it'll mean you can freely swap in different codecs.

The only thing you can do further is to conflate your custom code with
the codec, ie, so that you make a single chain that directly writes
index files.  But I'm not sure you'll gain much performance by doing
so... (and then you can't [as easily] swap codecs).

Have you profiled to see where the time is being spent?

Mike

On Thu, Mar 25, 2010 at 7:40 PM, britske <gbrits@gmail.com> wrote:
>
> Hi,
>
> perhaps first some background:
>
> I need to speed-up indexing for an particular application which has a pretty
> unsual schema: besides the normal stored and indexed fields we have about
> 20.000 fields  per document which are all indexed/ non-stored sInts.
>
> Obviously indexing was really slow with such a number of fields. With
> indexing through Solr we got about 0.3 docs/ sec ( on a ec2 m1.large
> instance)
>
> Since these ~20.000 fields are all build/ calculated analogously, we figured
> it would be possible to possibly build a low-level indexer for these fields
> (of which we had domain knowledge which we could use to possibly speed
> indexing up) and later merge them with the other fields to construct the
> entire index. So we did and now achieve around 1.8 docs/ sec (6x speedup).
> Not bad, but still not enough.
>
> As part of calculating these fields, we keep track of all fields, terms per
> field, and docids per term ( per field) .
> all the stuff is then ordered: the fields, the terms available for each
> field, the docids per term and inserted in that order using lowel-level
> classes like: FormatPostingsFieldsWriter , FormatPostingsTermsConsumer,
> FormatPostingsDocsConsumer (pseudo-code below)
>
> This constructs the following files: .tis, .tii, .frq.  ( + some default
> values for the other required files, which don't need actual data bc. these
> fields are not stored..)
>
> I should also mention that each call to the indexer writes all available
> fields, terms, docids to a new fsdirectory. So basically each call results
> in a complete index. (containing about 100 docs each, bc. otherwise we run
> into mem-problems keeping the  ordered maps)
>
> Since we already have all fields, terms, docids in order it seems (a lot of
> ) overkill to me to be going through the methods that above-mentioned
> classes offer, which were meant for more 'non-sequential / non-ordered'
> inserts (AFAIK).
>
> What would be the best way to write .tis, .tii and .frq ni a more sequential
> matter?
> I'm looking for something that would construct a byte-array for each file
> that conforms to the index-file definition of that particular file (or
> something) . I could try to do it myself and completely bypass all
> indexing-classes altogether and just write the files to disk. (Possible bc.
> as mentioned we have all data needed to construct a complete index) .
>
> However, perhaps there are classes I'm not aware of that help in getting the
> format right (it seems like a lot of trial-and-error coding otherwise)
>
> Thanks for any help, pointers, etc.
>
> Geert-Jan
>
>
>
> PSEUDO-CODE of the current low-level (not-so) sequential indexer:
>
>        foreach(String sField: fieldsInOrder){
>                --> add field to FieldInfos and grab newly created fieldInfo
>                --> add fieldInfo to FormatPostingsFieldsWriter and grab
> formatPostingsTermsConsumer
>                List<String> termsInOrder = termsInOrderForFieldMap(sField);
>                foreach(String sTerm: termsInOrder){
>                        FormatPostingsDocsConsumer frq = formatPostingsTermsConsumer.add(sTerm);
>                        List<Integer> docidsInOrderPerFieldTerm =
> docidsInOrderPerFieldTermMap(sField+"-"+sTerm);
>                        for(Integer docid:docidsInOrderPerFieldTerm){
>                                frq.addDoc(docid);
>                        }
>                        //close relevant stuff
>                }
>                //close relevant stuff
>        }
>        //close relevant stuff
>
>
>
> --
> View this message in context: http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p576998.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message