Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 71304 invoked from network); 6 Feb 2004 16:56:50 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 6 Feb 2004 16:56:50 -0000 Received: (qmail 79435 invoked by uid 500); 6 Feb 2004 16:54:57 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 79397 invoked by uid 500); 6 Feb 2004 16:54:56 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 79348 invoked from network); 6 Feb 2004 16:54:56 -0000 Received: from unknown (HELO exchange.sun.com) (192.18.33.10) by daedalus.apache.org with SMTP; 6 Feb 2004 16:54:56 -0000 Received: (qmail 28856 invoked by uid 50); 6 Feb 2004 16:55:14 -0000 Date: 6 Feb 2004 16:55:14 -0000 Message-ID: <20040206165514.28855.qmail@nagoya.betaversion.org> From: bugzilla@apache.org To: lucene-dev@jakarta.apache.org Cc: Subject: DO NOT REPLY [Bug 18927] - [PATCH] Term Vector support X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927 [PATCH] Term Vector support grant_ingersoll@yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Term Vector support |[PATCH] Term Vector support ------- Additional Comments From grant_ingersoll@yahoo.com 2004-02-06 16:55 ------- Attached is Dmitry's code updated for 1.3. Here are my notes on the implementation (which are also included in the attachment) The patch is in the zip and is named termVector1.3Patch.txt and was generate using cvs diff -Nu at the root of the tree. If there are any questions, I would be more than happy to help via the mailing list. ----------------------------------------------- Notes on the re-implemenation of Dmitry's Term Vector enhancements for Lucene 1.3. Please see http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene- dev@jakarta.apache.org&msgId=114748 for the original patch. General Notes ----------------------- I used Dmitry's code as a template by getting it working against 1.2 and then going through by hand and applying it against the HEAD. Thanks to Dmitry's great notes, it was relatively painless. All of the tests against HEAD pass. Differences from 1.2 Version ---------------------------- The most significant change I had to make is that in the TermFreqVector interface the getTermNumbers() method has been replaced by a getTerms() method which returns an array of Strings. These strings are the equivalent of Term.text() and store the unique string that has been indexed. While the numbering schema worked to save space it presented a problem in 1.3 when it comes to merging because the 1.3 code could support up to Long.MAX_LONG positions (see TermEnum and SegmentTermEnum) versus Integer.MAX_INTEGER in 1.2 (at least in my understanding). This prevented me from using the termMaps array technique used in 1.2 for remapping the term numbers from the old segment to the new segment. To solve this, we needed some globally unique identifier for a term. For this, I use the term text plus the field number that the terms came from (which is why there is a new accessor methods on TermFreqVector called get/setFieldNum). The side benefit of this is that merging is much simpler, as we can just iterate over the readers and vectors add the terms from the old TermVector to the new TermVectorWriter, we don't have to do any remapping. The down side to this is the term vector files are going to take up more space on the disk. I believe I have overcome the limitation that you can only retrieve term vectors on optimized indices. The SegmentsReader, which previously through runtime exceptions for the getTermVector methods now properly implements them. Compatibility ---------------------- Similar to Dmitry's, I believe the index files should be backward compatible. Performance ---------------------- Have not run thorough performance tests, but I did do the following runs, one with term vectors and one without term vectors: Index Size: 12598 documents with 88362 terms. The documents in question are XML files where all of the TEXT was extracted and indexed. Without TVs: Drive Space Used: 42 MB Time to index: 5 minutes, 30 seconds With TVs: Drive Space Used: 71.3 MB Time to index: 6 minutes, 2 seconds Your mileage may vary. Limitations ------------------------ Not sure what they are yet. I am sure there are places that could be optimized. The numbering scheme could probably be reinstituted by using some type of Paging Array or array of arrays scheme that allows you to store really large number of values. FilterIndexReader throws an UnsupportedOperationException for the new Term Vector methods. I did not test with compound files. Do not know if they are compatible. Other limitations are probably those of omission. That is, are the new methods sufficient for doing what people need to do? I can think of a few: 1. Since only terms and frequencies are stored, something to quickly calculate the actual weight of the term as it was scored for the query. I looked into this, but, frankly, I am fairly confused by the whole Scorer/Similarity interactions, especially when it comes to nested queries. 2. Perhaps the Document object itself should have a method similar to those on IndexReader. New File Notes ---------------------------------- src/java/org/apache/lucene/index/SegmentTermVector.java Implementation of TermFreqVector and TermPositionVector. src/java/org/apache/lucene/index/TermFreqVector.java Interface for describing a Document term vector. See notes above for what was changed from 1.2 src/java/org/apache/lucene/index/TermPositionVector.java No change from 1.2 version. src/java/org/apache/lucene/index/TermVectorsReader.java Changed get methods to return TermFreqVector interface instead of explicit SegmentTermVector. Added getTermPositions method to retrieve TermPositionVector(s). Changed reading in slightly to match the writing of a the Term text instead of the term number. src/java/org/apache/lucene/index/TermVectorsWriter.java Added documentation Changed the writing to write the term string instead of the term number Would be nice if there was a way to turn on or off the writing of positional information. See the TODO comment. src/test/org/apache/lucene/index/DocHelper.java Package local Class to help setup documents for testing. src/test/org/apache/lucene/index/TestDocumentWriter.java New test class for the DocumentWriter object. Probably needs to be fleshed out more to fully test. src/test/org/apache/lucene/index/TestFieldInfos.java Test for the new FieldInfos return values, etc. src/test/org/apache/lucene/index/TestFieldsReader.java Basic test for FieldsReader. Needs to be expanded to fully test functionality. src/test/org/apache/lucene/index/TestSegmentMerger.java Setups up two segments, including term vectors then merges them and asserts that items were properly merged. src/test/org/apache/lucene/index/TestSegmentReader.java Various tests for the SegmentReader. Tests retrieving a document, deleting a document, retrieving field names and retrieving terms. Has a placeholder for retrieving norms, but I did not implement, as I didn't fully understand how norms worked. src/test/org/apache/lucene/index/TestSegmentsReader.java Setups up a SegmentsReader made up of two Segments and does various tests on them. Needs to be filled in more completely. src/test/org/apache/lucene/index/TestSegmentTermDocs.java Has positive and negative tests for the SegmentTermDocs. src/test/org/apache/lucene/index/TestTermVectorsReader.java Writes out some term vectors and then asserts that they can be read back in src/test/org/apache/lucene/index/TestTermVectorsWriter.java Writes out some term vectors and then asserts that the proper files were created w/ the proper information in them. src/test/org/apache/lucene/search/TestTermVectors.java Searches over an indexed set of documents and then retrieves the term vectors for the documents. Also sets up a small collection of documents and maps containing term and frequency information and calculates that the term vectors are properly constructed. This is a fairly decent example of end to end use of the vectors. Existing File Changes: ---------------------------------- org/apache/lucene/analysis/PorterStemmer.java: Made public. Please, please, please apply this patch! I think several people have submitted this one and I vote for it as well! I use the implementation in other parts of my code and it is annoying to have to change it in my local copy every time there is a new release. org/apache/lucene/document/Document.java Added a getNumFields() method that will return the number of fields that a document has. org/apache/lucene/document/Field.java Same as 1.2 patch. org/apache/lucene/index/DocumentWriter.java Same as 1.2 patch. Updated some formatting. org/apache/lucene/index/FieldInfo.java Added constructor for indicating the term vector is stored. org/apache/lucene/index/FieldInfos.java Added support for term vector storage. Similar to 1.2 patch The add methods now return a Map of pairs. org/apache/lucene/index/FieldsReader.java Added comment. Now constructs the Field object with the termVector information org/apache/lucene/index/FilterIndexReader.java Formatted code. Added in implementation of Term Vector methods, but they are not implemented. org/apache/lucene/index/IndexReader.java Same as 1.2 patch, plus added a getTermVectorReader method which returns the TermVectorReader for the IndexReader. Added new getIndexedFieldNames(boolean) methods which retrieve all indexed field names based on whether the field stores term vectors or not. Added a package local method named getFieldInfos which returns the field infos object for the reader. This is needed in merging. Formatted code. org/apache/lucene/index/SegmentMerger.java Added comments and a mergeVectors() method that merges the terms in from the various readers into the new segment. Formatted code. org/apache/lucene/index/SegmentReader.java Added new TV files to the list of segments. Implemented new IndexReader methods for TVS. org/apache/lucene/index/SegmentTermDocs.java Formatted. Added in the isValid() method, but is commented out, as I am not sure it is needed. It was in 1.2 version. org/apache/lucene/index/SegmentTermEnum.java Same as 1.2 patch. Formatted. org/apache/lucene/index/SegmentTermPositions.java Same as 1.2 patch. org/apache/lucene/index/SegmentsReader.java Added a fieldInfos variable that is the summation of all of the fieldInfos from the other segments. This is used to implement the getFieldInfos() method, but is probably not all that useful. Implements the new term vector methods. org/apache/lucene/index/TermDocs.java Added isValid method per 1.2, but it is commented out as I am not sure we need it. Formatted code. org/apache/lucene/index/TermEnum.java Same as 1.2 patch. org/apache/lucene/index/TermInfosWriter.java Same as 1.2 patch. org/apache/lucene/search/FilteredTermEnum.java Implements size() method, but throws UnsupportedOperationException. org/apache/lucene/search/FuzzyTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException. org/apache/lucene/search/MultiSearcher.java Implements new count() methods as per 1.2 patch. org/apache/lucene/search/RemoteSearchable.java Same as MultiSearcher. org/apache/lucene/search/Searchable.java Added count() methods onto the interface. org/apache/lucene/search/Searcher.java Added count() methods support. org/apache/lucene/search/WildcardTermEnum.java Implements termNumber() and isValid() but both throw UnsupportedOperationException. org/apache/lucene/index/TestFilterIndexReader.java Implements the necessary TV methods org/apache/lucene/search/TestBasics.java Tests the count methods for the searcher. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org