lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <reng...@ix.netcom.com>
Subject Re: possible segment merge improvement?
Date Thu, 01 Nov 2007 17:27:29 GMT
I have looked into modifying FieldInfos to keep the fields sorted by  
field name, so the user would not be forced to add the fields in the  
same order.

Sparse documents are really not a problem. Since after the first  
merge of that document it will pickup the other fields from the other  
segments, after which it will merge "as the same".

I had to add getFieldInfos() to SegmentReader to make all of this  
work. I did not need to modify FieldInfos or FieldIno - I do the  
equality checks in SegmentMerger, and only perform them once.

Code looks as follows:

   private final int mergeFields() throws IOException {
	fieldInfos = new FieldInfos(); // merge field names
	int docCount = 0;
	for (int i = 0; i < readers.size(); i++) {
	    IndexReader reader = (IndexReader) readers.elementAt(i);
	    if (reader instanceof SegmentReader) {
		SegmentReader sreader = (SegmentReader) reader;
		for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
		    FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
		    fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector,  
fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, ! 
reader.hasNorms(fi.name));
		}
	    } else {
		addIndexed(reader, fieldInfos, reader.getFieldNames 
(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true,  
true, true);
		addIndexed(reader, fieldInfos, reader.getFieldNames 
(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
		addIndexed(reader, fieldInfos, reader.getFieldNames 
(IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
		addIndexed(reader, fieldInfos, reader.getFieldNames 
(IndexReader.FieldOption.TERMVECTOR), true, false, false);
		addIndexed(reader, fieldInfos, reader.getFieldNames 
(IndexReader.FieldOption.INDEXED), false, false, false);
		fieldInfos.add(reader.getFieldNames 
(IndexReader.FieldOption.UNINDEXED), false);
	    }
	}
	fieldInfos.write(directory, segment + ".fnm");

	SegmentReader[] sreaders = new SegmentReader[readers.size()];
	for (int i = 0; i < readers.size(); i++) {
	    IndexReader reader = (IndexReader) readers.elementAt(i);
	    boolean same = reader.getFieldNames().size() == fieldInfos.size 
() && reader instanceof SegmentReader;
	    if(same) {
		SegmentReader sreader = (SegmentReader) reader;
		for (int j = 0; same && j < fieldInfos.size(); j++) {
		    same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos 
().fieldName(j));
		}
		if(same)
		    sreaders[i] = sreader;
	    }
	}
	
	byte[] buffer = new byte[1024];

	// merge field values
	FieldsWriter fieldsWriter = new FieldsWriter(directory, segment,  
fieldInfos);
	
	try {
	    for (int i = 0; i < readers.size(); i++) {
		IndexReader reader = (IndexReader) readers.elementAt(i);
		SegmentReader sreader = sreaders[i];
		int maxDoc = reader.maxDoc();
		for (int j = 0; j < maxDoc; j++)
		    if (!reader.isDeleted(j)) { // skip deleted docs
			if (sreader!=null) {
			    int len = sreader.length(j);
			    if (len > buffer.length) {
				buffer = new byte[len * 2];
			    }
			    sreader.document(buffer, j, len);
			    fieldsWriter.addDocument(buffer, len);
			} else {
			    fieldsWriter.addDocument(reader.document(j));
			}
			docCount++;
		    }
	    }
	} finally {
	    fieldsWriter.close();
	}
	return docCount;
     }


On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:

> On 11/1/07, Doron Cohen <DORONC@il.ibm.com> wrote:
>> My reading of Robert's suggestion is that when we know that
>> FieldInfos of the resulted segment is identical to the
>> FieldInfos of a certain (sub) segment being merged then
>> there is no need to parse+rewrite the field data for all
>> docs of that (sub)segment, rather they can be written as is.
>
> Ah right... so for sparse fields it really depends on the order
> documents were added to the segment I imagine.
> If a document w/o all fields is added first, I guess the field numbers
> would be different in the segments.  Also, people should take care to
> add fields in the same order (first doc in the segment will define the
> fieldname->fieldnumber ordering I think)
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message