lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From britske <gbr...@gmail.com>
Subject custom low-level indexer (to speed things up) when fields, terms and docids are in order
Date Thu, 25 Mar 2010 23:40:30 GMT

Hi, 

perhaps first some background: 

I need to speed-up indexing for an particular application which has a pretty
unsual schema: besides the normal stored and indexed fields we have about
20.000 fields  per document which are all indexed/ non-stored sInts.

Obviously indexing was really slow with such a number of fields. With
indexing through Solr we got about 0.3 docs/ sec ( on a ec2 m1.large
instance) 

Since these ~20.000 fields are all build/ calculated analogously, we figured
it would be possible to possibly build a low-level indexer for these fields
(of which we had domain knowledge which we could use to possibly speed
indexing up) and later merge them with the other fields to construct the
entire index. So we did and now achieve around 1.8 docs/ sec (6x speedup).
Not bad, but still not enough. 

As part of calculating these fields, we keep track of all fields, terms per
field, and docids per term ( per field) . 
all the stuff is then ordered: the fields, the terms available for each
field, the docids per term and inserted in that order using lowel-level
classes like: FormatPostingsFieldsWriter , FormatPostingsTermsConsumer,
FormatPostingsDocsConsumer (pseudo-code below) 

This constructs the following files: .tis, .tii, .frq.  ( + some default
values for the other required files, which don't need actual data bc. these
fields are not stored..)

I should also mention that each call to the indexer writes all available
fields, terms, docids to a new fsdirectory. So basically each call results
in a complete index. (containing about 100 docs each, bc. otherwise we run
into mem-problems keeping the  ordered maps) 

Since we already have all fields, terms, docids in order it seems (a lot of
) overkill to me to be going through the methods that above-mentioned
classes offer, which were meant for more 'non-sequential / non-ordered' 
inserts (AFAIK). 

What would be the best way to write .tis, .tii and .frq ni a more sequential
matter? 
I'm looking for something that would construct a byte-array for each file
that conforms to the index-file definition of that particular file (or
something) . I could try to do it myself and completely bypass all
indexing-classes altogether and just write the files to disk. (Possible bc.
as mentioned we have all data needed to construct a complete index) . 

However, perhaps there are classes I'm not aware of that help in getting the
format right (it seems like a lot of trial-and-error coding otherwise) 

Thanks for any help, pointers, etc. 

Geert-Jan



PSEUDO-CODE of the current low-level (not-so) sequential indexer: 

	foreach(String sField: fieldsInOrder){
		--> add field to FieldInfos and grab newly created fieldInfo
		--> add fieldInfo to FormatPostingsFieldsWriter and grab
formatPostingsTermsConsumer
		List<String> termsInOrder = termsInOrderForFieldMap(sField);
		foreach(String sTerm: termsInOrder){
			FormatPostingsDocsConsumer frq = formatPostingsTermsConsumer.add(sTerm);
			List<Integer> docidsInOrderPerFieldTerm =
docidsInOrderPerFieldTermMap(sField+"-"+sTerm);
			for(Integer docid:docidsInOrderPerFieldTerm){
				frq.addDoc(docid);
			}
			//close relevant stuff
		}
		//close relevant stuff
	}
	//close relevant stuff



-- 
View this message in context: http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p576998.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message