Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 54922 invoked from network); 2 Nov 2007 21:50:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Nov 2007 21:50:00 -0000 Received: (qmail 11792 invoked by uid 500); 2 Nov 2007 21:49:42 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 11738 invoked by uid 500); 2 Nov 2007 21:49:42 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 11727 invoked by uid 99); 2 Nov 2007 21:49:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2007 14:49:42 -0700 X-ASF-Spam-Status: No, hits=3.2 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.86.89.70] (HELO elasmtp-banded.atl.sa.earthlink.net) (209.86.89.70) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2007 21:49:59 +0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk20050327; d=ix.netcom.com; b=B5cslHwJ0FQPteL7dH4U84yyZXA10vw4PLSTmVb29Ii7/HTjBUHGvIUHvKrZRwBH; h=Received:Mime-Version:In-Reply-To:References:Content-Type:Message-Id:From:Subject:Date:To:X-Mailer:X-ELNK-Trace:X-Originating-IP; Received: from [69.209.73.68] (helo=[192.168.1.64]) by elasmtp-banded.atl.sa.earthlink.net with asmtp (Exim 4.34) id 1Io4O8-0000Yk-KU for java-dev@lucene.apache.org; Fri, 02 Nov 2007 17:49:13 -0400 Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: References: <1284180.1194035991080.JavaMail.jira@brutus> <882FE8CA-CEA2-4127-8A39-E3B499DCDA88@ix.netcom.com> <999534E4-2A3D-4BF2-AAE8-6166E9B1C752@ix.netcom.com> Content-Type: multipart/alternative; boundary=Apple-Mail-5-984587640 Message-Id: <83A6C33B-AA42-4A77-BA88-19DE27EBCE4E@ix.netcom.com> From: robert engels Subject: Re: [jira] Commented: (LUCENE-1043) Speedup merging of stored fields when field mapping "matches" Date: Fri, 2 Nov 2007 16:49:12 -0500 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.752.3) X-ELNK-Trace: 33cbdd8ed9881ca8776432462e451d7bd15d05d9470ff7102a9fca4904bf7b792c53612b048bd0c8350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c X-Originating-IP: 69.209.73.68 X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-5-984587640 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Here is the fixed code - need special handling when determining the length of the document in a segment. FieldsReader.java final int docs(byte[] buffer,int[] lengths,int start,int max) throws IOException { final IndexInput indexStream = (IndexInput) indexLocal.get(); indexStream.seek(start * 8L); long startoffset = indexStream.readLong(); long lastoffset = startoffset; int count=0; while(countbuffer.length) break; lengths[count]=(int) (offset-lastoffset); lastoffset = offset; count++; } if(count==0) throw new IllegalStateException("buffer too small for a single document"); final IndexInput fieldsStream = (IndexInput) fieldsLocal.get(); fieldsStream.seek(startoffset); fieldsStream.readBytes(buffer,0,(int) (lastoffset-startoffset)); return count; } On Nov 2, 2007, at 4:33 PM, robert engels wrote: > Oops. Be careful with that code, merging works but indexing > fails... I have a bug lurking... > > On Nov 2, 2007, at 4:28 PM, robert engels wrote: > >> As expected it did not make much difference. Here is the code: >> >> FieldsReader.java >> final int docs(byte[] buffer,int[] lengths,int start,int max) >> throws IOException { >> final IndexInput indexStream = (IndexInput) indexLocal.get(); >> indexStream.seek(start * 8L); >> long startoffset = indexStream.readLong(); >> long lastoffset = startoffset; >> int count=0; >> while(count> long offset = indexStream.readLong(); >> if(offset-startoffset>buffer.length) >> break; >> lengths[count]=(int) (offset-lastoffset); >> lastoffset = offset; >> count++; >> } >> if(count==0) >> throw new IllegalStateException("buffer too small for a single >> document"); >> >> final IndexInput fieldsStream = (IndexInput) fieldsLocal.get(); >> fieldsStream.seek(startoffset); >> fieldsStream.readBytes(buffer,0,(int) (lastoffset- >> startoffset)); >> >> return count; >> } >> >> FieldsWriter.java >> final void addDocuments(byte[] buffer, int[] lengths, int >> ndocs) throws IOException { >> long position = fieldsStream.getFilePointer(); >> long start = position; >> for(int i=0;i> indexStream.writeLong(position); >> position+=lengths[i]; >> } >> fieldsStream.writeBytes(buffer,(int) (position-start)); >> } >> >> SegmentReader.java >> public int documents(byte[] buffer,int[] lengths,int start,int >> max) throws IOException { >> return fieldsReader.docs(buffer,lengths,start,max); >> } >> >> SegmentMerger.java >> private final int mergeFields() throws IOException { >> fieldInfos = new FieldInfos(); // merge field names >> int docCount = 0; >> for (int i = 0; i < readers.size(); i++) { >> IndexReader reader = (IndexReader) readers.elementAt(i); >> if (reader instanceof SegmentReader) { >> SegmentReader sreader = (SegmentReader) reader; >> for (int j = 0; j < sreader.getFieldInfos().size(); j++) { >> FieldInfo fi = sreader.getFieldInfos().fieldInfo(j); >> fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector, >> fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, ! >> reader.hasNorms(fi.name)); >> } >> } else { >> addIndexed(reader, fieldInfos, reader.getFieldNames >> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true, >> true, true); >> addIndexed(reader, fieldInfos, reader.getFieldNames >> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, >> false); >> addIndexed(reader, fieldInfos, reader.getFieldNames >> (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true); >> addIndexed(reader, fieldInfos, reader.getFieldNames >> (IndexReader.FieldOption.TERMVECTOR), true, false, false); >> addIndexed(reader, fieldInfos, reader.getFieldNames >> (IndexReader.FieldOption.INDEXED), false, false, false); >> fieldInfos.add(reader.getFieldNames >> (IndexReader.FieldOption.UNINDEXED), false); >> } >> } >> fieldInfos.write(directory, segment + ".fnm"); >> >> SegmentReader[] sreaders = new SegmentReader[readers.size()]; >> for (int i = 0; i < readers.size(); i++) { >> IndexReader reader = (IndexReader) readers.elementAt(i); >> boolean same = reader.getFieldNames().size() == >> fieldInfos.size() && reader instanceof SegmentReader; >> if(same) { >> SegmentReader sreader = (SegmentReader) reader; >> for (int j = 0; same && j < fieldInfos.size(); j++) { >> same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos >> ().fieldName(j)); >> } >> if(same) >> sreaders[i] = sreader; >> } >> } >> >> byte[] buffer = new byte[512*1024]; >> int[] lengths = new int[10000]; >> >> // merge field values >> FieldsWriter fieldsWriter = new FieldsWriter(directory, segment, >> fieldInfos); >> >> try { >> for (int i = 0; i < readers.size(); i++) { >> IndexReader reader = (IndexReader) readers.elementAt(i); >> SegmentReader sreader = sreaders[i]; >> int maxDoc = reader.maxDoc(); >> for (int j = 0; j < maxDoc;) >> if (!reader.isDeleted(j)) { // skip deleted docs >> if (sreader!=null) { >> int start=j; >> int ndocs=1; >> for(j++;j> +,ndocs++); >> ndocs = sreader.documents(buffer,lengths,start,j); >> fieldsWriter.addDocuments(buffer,lengths,ndocs); >> } else { >> fieldsWriter.addDocument(reader.document(j)); >> j++; >> } >> docCount++; >> } >> } >> } finally { >> fieldsWriter.close(); >> } >> return docCount; >> } >> >> >> >> >> On Nov 2, 2007, at 3:49 PM, robert engels wrote: >> >>> I am working on a bulk document copy right now - I will let you >>> know if it improves things much. >>> >>> I doubt it, because I already configure the streams to use fairly >>> large input and output buffers during a merge - but the memory >>> index merge may see additional benefits due to lower CPU calls. >>> >>> >>> On Nov 2, 2007, at 3:39 PM, robert engels (JIRA) wrote: >>> >>>> >>>> [ https://issues.apache.org/jira/browse/LUCENE-1043? >>>> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- >>>> tabpanel#action_12539696 ] >>>> >>>> robert engels commented on LUCENE-1043: >>>> --------------------------------------- >>>> >>>> When bulk copying the documents, I think you need to: >>>> >>>> read array of long from index (8 * (ndocs+1)) in long[ndocs+1] >>>> offsets; >>>> calculate length = offset[ndocs]-offset[0]; >>>> read bytes of length from document file >>>> startoffset = current output document stream position >>>> write bytes to output document >>>> modify offset[] adding startoffset-offset[0] to each entry >>>> write offset[] in bulk to index output >>>> >>>>> Speedup merging of stored fields when field mapping "matches" >>>>> ------------------------------------------------------------- >>>>> >>>>> Key: LUCENE-1043 >>>>> URL: https://issues.apache.org/jira/browse/ >>>>> LUCENE-1043 >>>>> Project: Lucene - Java >>>>> Issue Type: Improvement >>>>> Components: Index >>>>> Affects Versions: 2.2 >>>>> Reporter: Michael McCandless >>>>> Assignee: Michael McCandless >>>>> Priority: Minor >>>>> Fix For: 2.3 >>>>> >>>>> Attachments: LUCENE-1043.patch >>>>> >>>>> >>>>> Robert Engels suggested the following idea, here: >>>>> http://www.gossamer-threads.com/lists/lucene/java-dev/54217 >>>>> When merging in the stored fields from a segment, if the field >>>>> name -> >>>>> number mapping is identical then we can simply bulk copy the >>>>> entire >>>>> entry for the document rather than re-interpreting and then re- >>>>> writing >>>>> the actual stored fields. >>>>> I've pulled the code from the above thread and got it working >>>>> on the >>>>> current trunk. >>>> >>>> -- >>>> This message is automatically generated by JIRA. >>>> - >>>> You can reply to this email to add a comment to the issue online. >>>> >>>> >>>> ------------------------------------------------------------------- >>>> -- >>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-dev-help@lucene.apache.org >>>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-dev-help@lucene.apache.org >>> >> > --Apple-Mail-5-984587640--