lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: possible segment merge improvement?
Date Fri, 02 Nov 2007 18:23:28 GMT

OK, I got Robert's optimization working on the current trunk ... I
will open a Jira issue with the patch.

Mike

"robert engels" <rengels@ix.netcom.com> wrote:
> I have looked into modifying FieldInfos to keep the fields sorted by  
> field name, so the user would not be forced to add the fields in the  
> same order.
> 
> Sparse documents are really not a problem. Since after the first  
> merge of that document it will pickup the other fields from the other  
> segments, after which it will merge "as the same".
> 
> I had to add getFieldInfos() to SegmentReader to make all of this  
> work. I did not need to modify FieldInfos or FieldIno - I do the  
> equality checks in SegmentMerger, and only perform them once.
> 
> Code looks as follows:
> 
>    private final int mergeFields() throws IOException {
> 	fieldInfos = new FieldInfos(); // merge field names
> 	int docCount = 0;
> 	for (int i = 0; i < readers.size(); i++) {
> 	    IndexReader reader = (IndexReader) readers.elementAt(i);
> 	    if (reader instanceof SegmentReader) {
> 		SegmentReader sreader = (SegmentReader) reader;
> 		for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
> 		    FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
> 		    fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector,  
> fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, ! 
> reader.hasNorms(fi.name));
> 		}
> 	    } else {
> 		addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true,  
> true, true);
> 		addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
> 		addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
> 		addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR), true, false, false);
> 		addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.INDEXED), false, false, false);
> 		fieldInfos.add(reader.getFieldNames 
> (IndexReader.FieldOption.UNINDEXED), false);
> 	    }
> 	}
> 	fieldInfos.write(directory, segment + ".fnm");
> 
> 	SegmentReader[] sreaders = new SegmentReader[readers.size()];
> 	for (int i = 0; i < readers.size(); i++) {
> 	    IndexReader reader = (IndexReader) readers.elementAt(i);
> 	    boolean same = reader.getFieldNames().size() == fieldInfos.size 
> () && reader instanceof SegmentReader;
> 	    if(same) {
> 		SegmentReader sreader = (SegmentReader) reader;
> 		for (int j = 0; same && j < fieldInfos.size(); j++) {
> 		    same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos 
> ().fieldName(j));
> 		}
> 		if(same)
> 		    sreaders[i] = sreader;
> 	    }
> 	}
> 	
> 	byte[] buffer = new byte[1024];
> 
> 	// merge field values
> 	FieldsWriter fieldsWriter = new FieldsWriter(directory, segment,  
> fieldInfos);
> 	
> 	try {
> 	    for (int i = 0; i < readers.size(); i++) {
> 		IndexReader reader = (IndexReader) readers.elementAt(i);
> 		SegmentReader sreader = sreaders[i];
> 		int maxDoc = reader.maxDoc();
> 		for (int j = 0; j < maxDoc; j++)
> 		    if (!reader.isDeleted(j)) { // skip deleted docs
> 			if (sreader!=null) {
> 			    int len = sreader.length(j);
> 			    if (len > buffer.length) {
> 				buffer = new byte[len * 2];
> 			    }
> 			    sreader.document(buffer, j, len);
> 			    fieldsWriter.addDocument(buffer, len);
> 			} else {
> 			    fieldsWriter.addDocument(reader.document(j));
> 			}
> 			docCount++;
> 		    }
> 	    }
> 	} finally {
> 	    fieldsWriter.close();
> 	}
> 	return docCount;
>      }
> 
> 
> On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:
> 
> > On 11/1/07, Doron Cohen <DORONC@il.ibm.com> wrote:
> >> My reading of Robert's suggestion is that when we know that
> >> FieldInfos of the resulted segment is identical to the
> >> FieldInfos of a certain (sub) segment being merged then
> >> there is no need to parse+rewrite the field data for all
> >> docs of that (sub)segment, rather they can be written as is.
> >
> > Ah right... so for sparse fields it really depends on the order
> > documents were added to the segment I imagine.
> > If a document w/o all fields is added first, I guess the field numbers
> > would be different in the segments.  Also, people should take care to
> > add fields in the same order (first doc in the segment will define the
> > fieldname->fieldnumber ordering I think)
> >
> > -Yonik
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message