Return-Path: Delivered-To: apmail-incubator-lucene-net-commits-archive@minotaur.apache.org Received: (qmail 13831 invoked from network); 21 Mar 2009 12:52:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Mar 2009 12:52:22 -0000 Received: (qmail 11026 invoked by uid 500); 21 Mar 2009 12:52:22 -0000 Delivered-To: apmail-incubator-lucene-net-commits-archive@incubator.apache.org Received: (qmail 11011 invoked by uid 500); 21 Mar 2009 12:52:22 -0000 Mailing-List: contact lucene-net-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucene-net-dev@incubator.apache.org Delivered-To: mailing list lucene-net-commits@incubator.apache.org Received: (qmail 11002 invoked by uid 99); 21 Mar 2009 12:52:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Mar 2009 05:52:21 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Mar 2009 12:52:12 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id F3D8F23889F5; Sat, 21 Mar 2009 12:51:51 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r756927 [3/4] - in /incubator/lucene.net/trunk/C#/src/Lucene.Net: ./ Analysis/Standard/ Index/ Store/ Date: Sat, 21 Mar 2009 12:51:45 -0000 To: lucene-net-commits@incubator.apache.org From: digy@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20090321125151.F3D8F23889F5@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Modified: incubator/lucene.net/trunk/C#/src/Lucene.Net/Lucene.Net.xml URL: http://svn.apache.org/viewvc/incubator/lucene.net/trunk/C%23/src/Lucene.Net/Lucene.Net.xml?rev=756927&r1=756926&r2=756927&view=diff ============================================================================== --- incubator/lucene.net/trunk/C#/src/Lucene.Net/Lucene.Net.xml (original) +++ incubator/lucene.net/trunk/C#/src/Lucene.Net/Lucene.Net.xml Sat Mar 21 12:51:41 2009 @@ -4,314 +4,505 @@ Lucene.Net - - This interface describes a character stream that maintains line and - column number positions of the characters. It also has the capability - to backup the stream to some extent. An implementation of this - interface is used in the TokenManager implementation generated by - JavaCCParser. + + +

Base class for Locking implementation. {@link Directory} uses + instances of this class to implement locking.

+ +

Note that there are some useful tools to verify that + your LockFactory is working correctly: {@link + VerifyingLockFactory}, {@link LockStressTest}, {@link + LockVerifyServer}.

- All the methods except backup can be implemented in any fashion. backup - needs to be implemented correctly for the correct operation of the lexer. - Rest of the methods are all used to get information like line number, - column number and the String that constitutes a token and are not used - by the lexer. Hence their implementation won't affect the generated lexer's - operation. -
-
- - Returns the next character from the selected input. The method - of selecting the input is the responsibility of the class - implementing this interface. Can throw any java.io.IOException. - - - - Returns the column number of the last character for current token (being - matched after the last call to BeginTOken). - - - - Returns the line number of the last character for current token (being - matched after the last call to BeginTOken). + + + + + + - - Returns the column number of the first character for current token (being - matched after the last call to BeginTOken). + + Set the prefix in use for all locks created in this + LockFactory. This is normally called once, when a + Directory gets this LockFactory instance. However, you + can also call this (after this instance is assigned to + a Directory) to override the prefix in use. This + is helpful if you're running Lucene on machines that + have different mount points for the same shared + directory. - - Returns the line number of the first character for current token (being - matched after the last call to BeginTOken). - + + Get the prefix in use for all locks created in this LockFactory. - - Backs up the input stream by amount steps. Lexer calls this method if it - had already read some characters, but could not use them to match a - (longer) token. So, they will be used again as the prefix of the next - token and it is the implemetation's responsibility to do this right. - + + Return a new Lock instance identified by lockName. + name of the lock to be created. + - - Returns the next character that marks the beginning of the next token. - All characters must remain in the buffer between two successive calls - to this method to implement backup correctly. + + Attempt to clear (forcefully unlock and remove) the + specified lock. Only call this at a time when you are + certain this lock is no longer in use. + name of the lock to be cleared. + - - Returns a string made up of characters from the marked token beginning - to the current buffer position. Implementations have the choice of returning - anything that they want to. For example, for efficiency, one might decide - to just return null, which is a valid implementation. - + + should be a unique id across all clients + + the LockFactory that we are testing + + host or IP where {@link LockVerifyServer} + is running + + the port {@link LockVerifyServer} is + listening on + - - Returns an array of characters that make up the suffix of length 'len' for - the currently matched token. This is used to build up the matched string - for use in actions in the case of MORE. A simple and inefficient - implementation of this is as follows : - - { - String t = GetImage(); - return t.substring(t.length() - len, t.length()).toCharArray(); - } + + + Pass this value to {@link #Obtain(long)} to try + forever to obtain the lock. - - The lexer calls this function to indicate that it is done with the stream - and hence implementations can free any resources held by this class. - Again, the body of this function can be just empty and it will not - affect the lexer's operation. + + How long {@link #Obtain(long)} waits, in milliseconds, + in between attempts to acquire the lock. - - - Constructs from a Reader. - - - This exception is thrown when parse errors are encountered. - You can explicitly create objects of this exception type by - calling the method generateParseException in the generated - parser. - - You can modify this class to customize your error reporting - mechanisms so long as you retain the public fields. + + Attempts to obtain exclusive access and immediately return + upon success or failure. + true iff exclusive access is obtained + - - This constructor is used by the method "generateParseException" - in the generated parser. Calling this constructor generates - a new object of this type with the fields "currentToken", - "expectedTokenSequences", and "tokenImage" set. The boolean - flag "specialConstructor" is also set to true to indicate that - this constructor was used to create this object. - This constructor calls its super class with the empty string - to force the "toString" method of parent class "Throwable" to - print the error message in the form: - ParseException: <result of getMessage> + + If a lock obtain called, this failureReason may be set + with the "root cause" Exception as to why the lock was + not obtained. - - The following constructors are for use by you for whatever - purpose you can think of. Constructing the exception in this - manner makes the exception behave in the normal way - i.e., as - documented in the class "Throwable". The fields "errorToken", - "expectedTokenSequences", and "tokenImage" do not contain - relevant information. The JavaCC generated code does not use - these constructors. + + Attempts to obtain an exclusive lock within amount of + time given. Polls once per {@link #LOCK_POLL_INTERVAL} + (currently 1000) milliseconds until lockWaitTimeout is + passed. + + length of time to wait in + milliseconds or {@link + #LOCK_OBTAIN_WAIT_FOREVER} to retry forever + + true if lock was obtained + + LockObtainFailedException if lock wait times out + IllegalArgumentException if lockWaitTimeout is + out of bounds + IOException if obtain() throws IOException - - This variable determines which constructor was used to create - this object and thereby affects the semantics of the - "getMessage" method (see below). - + + Releases exclusive access. - - This is the last token that has been consumed successfully. If - this object has been created due to a parse error, the token - followng this token will (therefore) be the first error token. + + Returns true if the resource is currently locked. Note that one must + still call {@link #Obtain()} before using the resource. - - Each entry in this array is an array of integers. Each array - of integers represents a sequence of tokens (by their ordinal - values) that is expected at this point of the parse. - + + Utility class for executing code with exclusive access. - - This is a reference to the "tokenImage" array of the generated - parser within which the parse error occurred. This array is - defined in the generated ...Constants interface. - + + Constructs an executor that will grab the named lock. - - The end of line string for this machine. + + Code to execute with exclusive access. - - Used to convert raw characters to their escaped version - when these raw version cannot be used as part of an ASCII - string literal. + + Calls {@link #doBody} while lock is obtained. Blocks if lock + cannot be obtained immediately. Retries to obtain lock once per second + until it is obtained, or until it has tried ten times. Lock is released when + {@link #doBody} exits. - - - This method has the standard behavior when this object has been - created using the standard constructors. Otherwise, it uses - "currentToken" and "expectedTokenSequences" to generate a parse - error message and returns it. If this object has been created - due to a parse error, and you do not catch it (it gets thrown - from the parser), then this method is called during the printing - of the final stack trace, and hence the correct error message - gets displayed. + LockObtainFailedException if lock could not + be obtained + IOException if {@link Lock#obtain} throws IOException - - Filters {@link StandardTokenizer} with {@link StandardFilter}, {@link - LowerCaseFilter} and {@link StopFilter}, using a list of English stop words. + + A memory-resident {@link IndexInput} implementation. - $Id: StandardAnalyzer.java 219090 2005-07-14 20:36:28Z dnaber $ + $Id: RAMInputStream.java 598693 2007-11-27 17:01:21Z mikemccand $ - - - Creates a TokenStream which tokenizes all the text in the provided - Reader. Default implementation forwards to tokenStream(Reader) for - compatibility with older version. Override to allow Analyzer to choose - strategy based on document and/or field. Must be able to handle null - field name for backward compatibility. + + Abstract base class for input from a file in a {@link Directory}. A + random-access input stream. Used for all Lucene index input operations. + + - - Invoked before indexing a Field instance if - terms have already been added to that field. This allows custom - analyzers to place an automatic position increment gap between - Field instances using the same field name. The default value - position increment gap is 0. With a 0 position increment gap and - the typical default token position increment of 1, all terms in a field, - including across Field instances, are in successive positions, allowing - exact PhraseQuery matches, for instance, across Field instance boundaries. - - - Field name being indexed. + + Reads and returns a single byte. + + + + + Reads a specified number of bytes into an array at the specified offset. + the array to read bytes into - position increment gap, added to the next token emitted from {@link #TokenStream(String,Reader)} - + the offset in the array to start storing bytes + + the number of bytes to read + + + - - An array containing some common English words that are usually not - useful for searching. + + Reads a specified number of bytes into an array at the + specified offset with control over whether the read + should be buffered (callers who have their own buffer + should pass in "false" for useBuffer). Currently only + {@link BufferedIndexInput} respects this parameter. + the array to read bytes into + + the offset in the array to start storing bytes + + the number of bytes to read + + set to false if the caller will handle + buffering. + + + - - Builds an analyzer with the default stop words ({@link #STOP_WORDS}). - - - Builds an analyzer with the given stop words. - - - Builds an analyzer with the given stop words. + + Reads four bytes and returns an int. + + - - Builds an analyzer with the stop words from the given file. - + + Reads an int stored in variable-length format. Reads between one and + five bytes. Smaller values take fewer bytes. Negative numbers are not + supported. + + - - Builds an analyzer with the stop words from the given reader. - + + Reads eight bytes and returns a long. + - - Constructs a {@link StandardTokenizer} filtered by a {@link - StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. + + Reads a long stored in variable-length format. Reads between one and + nine bytes. Smaller values take fewer bytes. Negative numbers are not + supported. - - Normalizes tokens extracted with {@link StandardTokenizer}. + + Reads a string. + + - - - - Returns the next token in the stream, or null at EOS. + + Reads UTF-8 encoded characters into an array. + the array to read characters into + + the offset in the array to start storing characters + + the number of characters to read + + + - - Releases resources associated with this stream. + + Expert + + Similar to {@link #ReadChars(char[], int, int)} but does not do any conversion operations on the bytes it is reading in. It still + has to invoke {@link #ReadByte()} just as {@link #ReadChars(char[], int, int)} does, but it does not need a buffer to store anything + and it does not have to do any of the bitwise operations, since we don't actually care what is in the byte except to determine + how many more bytes to read + + The number of chars to read + - - The source of tokens for this filter. + + Closes the stream to futher operations. - - Construct a token stream filtering the given input. + + Returns the current position in this file, where the next read will + occur. + + + - - Close the input TokenStream. + + Sets current position in this file, where the next read will occur. + + - - Construct filtering in. + + The number of bytes in the file. - - - - - The text source for this Tokenizer. + + + Matches spans which are near one another. One can specify slop, the + maximum number of intervening unmatched positions, as well as whether + matches are required to be in-order. + - - Construct a tokenizer with null input. + + Base class for span-based queries. - - Construct a token stream processing the given input. + + + Sets the boost for this query clause to b. Documents + matching this clause will (in addition to the normal weightings) have + their score multiplied by b. + - - By default, closes the input Reader. + + Gets the boost for this clause. Documents matching + this clause will (in addition to the normal weightings) have their score + multiplied by b. The boost is 1.0 by default. + - - Constructs a tokenizer for this Reader. + + + Prints a query to a string. - - - By default, closes the input Reader. + + + Expert: Constructs and initializes a Weight for a top-level query. - - By default, closes the input Reader. + + Expert: called to re-write queries into primitive queries. For example, + a PrefixQuery will be rewritten into a BooleanQuery that consists + of TermQuerys. + + + + Expert: called when re-writing queries under MultiSearcher. + + Create a single query suitable for use by all subsearchers (in 1-1 + correspondence with queries). This is an optimization of the OR of + all queries. We handle the common optimization cases of equal + queries and overlapping clauses of boolean OR queries (as generated + by MultiTermQuery.rewrite() and RangeQuery.rewrite()). + Be careful overriding this method as queries[0] determines which + method will be called and is not necessarily of the same type as + the other queries. + + + + Expert: adds all terms occuring in this query to the terms set. Only + works if this query is in its {@link #rewrite rewritten} form. + + + UnsupportedOperationException if this query is not yet rewritten + + + + Expert: Returns the Similarity implementation to be used for this query. + Subclasses may override this method to specify their own Similarity + implementation, perhaps one that delegates through that of the Searcher. + By default the Searcher's Similarity implementation is returned. + + + + Returns a clone of this query. + + + Expert: Returns the matches for this query in an index. Used internally + to search for spans. + + + + Returns the name of the field matched by this query. + + + Returns a collection of all terms matched by this query. + use extractTerms instead + + + + + + Construct a SpanNearQuery. Matches spans matching a span from each + clause, with up to slop total unmatched positions between + them. * When inOrder is true, the spans from each clause + must be * ordered as in clauses. + + + + Return the clauses whose spans are matched. + + + Return the maximum number of intervening unmatched positions permitted. + + + Return true if matches are required to be in-order. + + + Returns a collection of all terms matched by this query. + use extractTerms instead + + + + + + Returns true iff o is equal to this. + + + + Represents sorting by computed relevance. Using this sort criteria returns + the same results as calling + {@link Searcher#Search(Query) Searcher#search()}without a sort criteria, + only with slightly more overhead. + + + + Represents sorting by index order. + + + Sorts by computed relevance. This is the same sort criteria as calling + {@link Searcher#Search(Query) Searcher#search()}without a sort criteria, + only with slightly more overhead. + + + + Sorts by the terms in field then by index order (document + number). The type of value in field is determined + automatically. + + + + + + + Sorts possibly in reverse by the terms in field then by + index order (document number). The type of value in field is + determined automatically. + + + + + + + Sorts in succession by the terms in each field. The type of value in + field is determined automatically. + + + + + + + Sorts by the criteria in the given SortField. + + + Sorts in succession by the criteria in each SortField. + + + Sets the sort to the terms in field then by index order + (document number). + + + + Sets the sort to the terms in field possibly in reverse, + then by index order (document number). + + + + Sets the sort to the terms in each field in succession. + + + Sets the sort to the given criteria. + + + Sets the sort to the given criteria in succession. + + + Representation of the sort criteria. + Array of SortField objects used in this sort criteria + + + + Abstract base class providing a mechanism to restrict searches to a subset + of an index. + + + + Returns a BitSet with true for documents which should be permitted in + search results, and false for those that should not. + + + + + Returns the field name for this query + + + Returns the value of the lower endpoint of this range query, null if open ended + + + Returns the value of the upper endpoint of this range query, null if open ended + + + Returns true if the lower endpoint is inclusive + + + Returns true if the upper endpoint is inclusive + + + Prints a user-readable version of this query. + + + Returns true if o is equal to this. + + + Returns a hash code value for this object. - + Describes the input token stream. - + An integer that describes the kind of this token. This numbering system is determined by JavaCCParser, and a table of these numbers is stored in the file ...Constants.java. - + beginLine and beginColumn describe the position of the first character of this token; endLine and endColumn describe the position of the last character of this token. - + beginLine and beginColumn describe the position of the first character of this token; endLine and endColumn describe the position of the last character of this token. - + beginLine and beginColumn describe the position of the first character of this token; endLine and endColumn describe the position of the last character of this token. - + beginLine and beginColumn describe the position of the first character of this token; endLine and endColumn describe the position of the last character of this token. - + The string image of the token. - + A reference to the next regular (non-special) token from the input stream. If this is the last token from the input stream, or if the token manager has not read tokens beyond this one, this field is @@ -320,7 +511,7 @@ this field. - + This field is used to access special tokens that occur prior to this token, but after the immediately preceding regular (non-special) token. If there are no such special tokens, this field is set to null. @@ -333,10 +524,10 @@ is no such token, this field is null. - + Returns the image. - + Returns a new Token object, by default. However, if you want, you can create and return subclass objects based on the value of ofKind. Simply add the cases to the switch for all those special cases. @@ -349,913 +540,965 @@ variable to the appropriate type and use it in your lexical actions. - - Lexical error occured. - - - An attempt wass made to create a second instance of a static token manager. - - - Tried to change to an invalid lexical state. - - - Detected (and bailed out of) an infinite loop in the token manager. - - - Indicates the reason why the exception is thrown. It will have - one of the above 4 values. + + Store a sorted collection of {@link Lucene.Net.Index.TermVectorEntry}s. Collects all term information + into a single, SortedSet. +
+ NOTE: This Mapper ignores all Field information for the Document. This means that if you are using offset/positions you will not + know what Fields they correlate with. +
+ This is not thread-safe
- - Replaces unprintable characters by their espaced (or unicode escaped) - equivalents in the given string + + The TermVectorMapper can be used to map Term Vectors into your own + structure instead of the parallel array structure used by + {@link Lucene.Net.Index.IndexReader#GetTermFreqVector(int,String)}. +

+ It is up to the implementation to make sure it is thread-safe. + + +

- - Returns a detailed message for the Error when it is thrown by the - token manager to indicate a lexical error. - Parameters : - EOFSeen : indicates if EOF caused the lexicl error - curLexState : lexical state in which this error occured - errorLine : line number when the error occured - errorColumn : column number when the error occured - errorAfter : prefix that was seen before this error occured - curchar : the offending character - Note: You can customize the lexical error message by modifying this method. - + + + true if this mapper should tell Lucene to ignore positions even if they are stored + + similar to ignoringPositions + - - You can also modify the body of this method to customize your error messages. - For example, cases like LOOP_DETECTED and INVALID_LEXICAL_STATE are not - of end-users concern, so you can return something like : - - "Internal Error : Please file a bug report .... " + + Tell the mapper what to expect in regards to field, number of terms, offset and position storage. + This method will be called once before retrieving the vector for a field. - from this method for such cases in the release version of your parser. + This method will be called before {@link #Map(String,int,TermVectorOffsetInfo[],int[])}. + The field the vector is for + + The number of terms that need to be mapped + + true if the mapper should expect offset information + + true if the mapper should expect positions info + - - An abstract base class for simple, character-oriented tokenizers. + + Map the Term Vector information into your own structure + The term to add to the vector + + The frequency of the term in the document + + null if the offset is not specified, otherwise the offset into the field of the term + + null if the position is not specified, otherwise the position in the field of the term + - - Returns true iff a character should be included in a token. This - tokenizer generates as tokens adjacent sequences of characters which - satisfy this predicate. Characters for which this is false are used to - define token boundaries and are not included in tokens. + + Indicate to Lucene that even if there are positions stored, this mapper is not interested in them and they + can be skipped over. Derived classes should set this to true if they want to ignore positions. The default + is false, meaning positions will be loaded if they are stored. + false + - - Called on each token character to normalize it before it is added to the - token. The default implementation does nothing. Subclasses may use this - to, e.g., lowercase tokens. - + + + + + false + - - Returns the next token in the stream, or null at EOS. + + Passes down the index of the document whose term vector is currently being mapped, + once for each top level call to a term vector reader. +

+ Default implementation IGNORES the document number. Override if your implementation needs the document number. +

+ NOTE: Document numbers are internal to Lucene and subject to change depending on indexing operations. + +

+ index of document currently being mapped +
- - - To replace accented characters in a String by unaccented equivalents. + + Stand-in name for the field in {@link TermVectorEntry}. - - "Tokenizes" the entire stream as a single token. This is useful - for data like zip codes, ids, and some product names. - + + + A Comparator for sorting {@link TermVectorEntry}s + - - Emits the entire input as a single token. + + + The term to map + + The frequency of the term + + Offset information, may be null + + Position information, may be null + - - Removes words that are too long and too short from the stream. + + The TermVectorEntrySet. A SortedSet of {@link TermVectorEntry} objects. Sort is by the comparator passed into the constructor. +
+ This set will be empty until after the mapping process takes place.
- David Spencer - - $Id: LengthFilter.java 347992 2005-11-21 21:41:43Z dnaber $ - -
- - Build a filter that removes words that are too long or too - short from the text. - + The SortedSet of {@link TermVectorEntry}. + - - Returns the next input Token whose termText() is the right len + + The file format version, a negative number. - - A LetterTokenizer is a tokenizer that divides text at non-letters. That's - to say, it defines tokens as maximal strings of adjacent letters, as defined - by java.lang.Character.isLetter() predicate. - Note: this does a decent job for most European languages, but does a terrible - job for some Asian languages, where words are not separated by spaces. + + This format adds details used for lockless commits. It differs + slightly from the previous format in that file names + are never re-used (write once). Instead, each file is + written to the next generation. For example, + segments_1, segments_2, etc. This allows us to not use + a commit lock. See file + formats for details. + + + + This format adds a "hasSingleNormFile" flag into each segment info. + See LUCENE-756 + for details. + + + + This format allows multiple segments to share a single + vectors and stored fields file. - - Construct a new LetterTokenizer. - - - Collects only characters which satisfy - {@link Character#isLetter(char)}. + + counts how often the index has been changed by adding or deleting docs. + starting with the current time in milliseconds forces to create unique version numbers. - - Normalizes token text to lower case. + + If non-null, information about loading segments_N files + + + + + Get the generation (N) of the current segments_N file + from a list of files. - $Id: LowerCaseFilter.java 150259 2004-03-29 22:48:07Z cutting $ - - - - - Construct a new LowerCaseTokenizer. + -- array of file names to check + - - Collects only characters which satisfy - {@link Character#isLetter(char)}. + + Get the generation (N) of the current segments_N file + in the directory. + + -- directory to search for the latest segments_N file + - - - Constructs with default analyzer. + + Get the filename of the current segments_N file + from a list of files. - Any fields not specifically - defined to use a different analyzer will use the one provided here. + -- array of file names to check - - Defines an analyzer to use for the specified field. + + Get the filename of the current segments_N file + in the directory. - field name requiring a non-default analyzer - - non-default analyzer to use for field + -- directory to search for the latest segments_N file - - - Returns the next input Token, after being stemmed + + Get the segments_N filename in use by this segment infos. - - - Stemmer, implementing the Porter Stemming Algorithm - - The Stemmer class transforms a word into its root form. The input - word can be provided a character at time (by calling add()), or at once - by calling one of the various stem(something) methods. + + Parse the generation off the segments file name and + return it. - - reset() resets the stemmer so it can stem another word. If you invoke - the stemmer by calling add(char) and then Stem(), you must call reset() - before starting another word. - + + Get the next segments_N filename that will be written. - - Add a character to the word being stemmed. When you are finished - adding characters, you can call Stem(void) to process the word. + + Read a particular segmentFileName. Note that this may + throw an IOException if a commit is in process. + + -- directory containing the segments file + + -- segment file to load + + CorruptIndexException if the index is corrupt + IOException if there is a low-level IO error - - After a word has been stemmed, it can be retrieved by toString(), - or a reference to the internal buffer can be retrieved by getResultBuffer - and getResultLength (which is generally more efficient.) + + This version of read uses the retry logic (for lock-less + commits) to find the right segments file to load. + CorruptIndexException if the index is corrupt + IOException if there is a low-level IO error - - Returns the length of the word resulting from the stemming process. - - - Returns a reference to a character buffer containing the results of - the stemming process. You also need to consult getResultLength() - to determine the length of the result. + + Returns a copy of this instance, also copying each + SegmentInfo. - - Stem a word provided as a String. Returns the result as a String. + + version number when this SegmentInfos was generated. - - Stem a word contained in a char[]. Returns true if the stemming process - resulted in a word different from the input. You can retrieve the - result with getResultLength()/getResultBuffer() or toString(). - + + Current version number from segments file. + CorruptIndexException if the index is corrupt + IOException if there is a low-level IO error - - Stem a word contained in a portion of a char[] array. Returns - true if the stemming process resulted in a word different from - the input. You can retrieve the result with - getResultLength()/getResultBuffer() or toString(). + + If non-null, information about retries when loading + the segments file will be printed to this. - - Stem a word contained in a leading portion of a char[] array. - Returns true if the stemming process resulted in a word different - from the input. You can retrieve the result with - getResultLength()/getResultBuffer() or toString(). + + Advanced: set how many times to try loading the + segments.gen file contents to determine current segment + generation. This file is only referenced when the + primary method (listing the directory) fails. - - Stem the word placed into the Stemmer buffer through calls to add(). - Returns true if the stemming process resulted in a word different - from the input. You can retrieve the result with - getResultLength()/getResultBuffer() or toString(). - + + + - - Test program for demonstrating the Stemmer. It reads a file and - stems each word, writing the result to standard out. - Usage: Stemmer file-name + + Advanced: set how many milliseconds to pause in between + attempts to load the segments.gen file. - - An Analyzer that filters LetterTokenizer with LowerCaseFilter. - - - Filters LetterTokenizer with LowerCaseFilter and StopFilter. + + + - - An array containing some common English words that are not usually useful - for searching. + + Advanced: set how many times to try incrementing the + gen when loading the segments file. This only runs if + the primary (listing directory) and secondary (opening + segments.gen file) methods fail to find the segments + file. - - Builds an analyzer which removes words in ENGLISH_STOP_WORDS. - - - Builds an analyzer with the stop words from the given set. - - - Builds an analyzer which removes words in the provided array. - - - Builds an analyzer with the stop words from the given file. - + + - - Builds an analyzer with the stop words from the given reader. - + + - - Filters LowerCaseTokenizer with StopFilter. + + Returns a new SegmentInfos containg the SegmentInfo + instances in the specified range first (inclusive) to + last (exclusive), so total number of segments returned + is last-first. + + + + Utility class for executing code that needs to do + something with the current segments file. This is + necessary with lock-less commits because from the time + you locate the current segments file name, until you + actually open it, read its contents, or check modified + time, etc., it could have been deleted due to a writer + commit finishing. + + + + Subclass must implement this. The assumption is an + IOException will be thrown if something goes wrong + during the processing that could have been caused by + a writer committing. + - - Removes stop words from a token stream. + + Filename filter that accept filenames and extensions only created by Lucene. + + + Daniel Naber / Bernhard Messer + + $rcs = ' $Id: Exp $ ' ; + - - Construct a token stream filtering the given input. + + Returns true if this is a file that would be contained + in a CFS file. This function should only be called on + files that pass the above "accept" (ie, are already + known to be a Lucene index file). + - - Constructs a filter which removes words from the input - TokenStream that are named in the array of words. + + Access to the Fieldable Info file that describes document fields and whether or + not they are indexed. Each segment has a separate Fieldable Info file. Objects + of this class are thread-safe for multiple readers, but only one thread can + be adding documents at a time, with no other reader or writer threads + accessing this object. - - Construct a token stream filtering the given input. - - - The set of Stop Words, as Strings. If ignoreCase is true, all strings should be lower cased + + Construct a FieldInfos object using the directory and the name of the file + IndexInput + + The directory to open the IndexInput from - -Ignore case when stopping. The stopWords set must be setup to contain only lower case words + The name of the file to open the IndexInput from in the Directory + IOException - - Constructs a filter which removes words from the input - TokenStream that are named in the Set. - It is crucial that an efficient Set implementation is used - for maximum performance. - - - - + + Returns a deep clone of this FieldInfos instance. - - Builds a Set from an array of stop words, - appropriate for passing into the StopFilter constructor. - This permits this stopWords construction to be cached once when - an Analyzer is constructed. + + Adds field info for a Document. + + + Add fields that are indexed. Whether they have termvectors has to be specified. - - - - - - + The names of the fields - If true, all words are lower cased first. + Whether the fields store term vectors or not + + treu if positions should be stored. + + true if offsets should be stored - a Set containing the words - - - - Returns the next input Token whose termText() is not a stop word. - - - A Token is an occurence of a term from the text of a field. It consists of - a term's text, the start and end offset of the term in the text of the field, - and a type string. - The start and end offsets permit applications to re-associate a token with - its source text, e.g., to display highlighted query terms in a document - browser, or to show matching text fragments in a KWIC (KeyWord In Context) - display, etc. - The type is an interned string, assigned by a lexical analyzer - (a.k.a. tokenizer), naming the lexical or syntactic class that the token - belongs to. For example an end of sentence marker token might be implemented - with type "eos". The default token type is "word". - - - - - - - Returns the position increment of this Token. - - - - - Returns the Token's term text. - - - Returns this Token's starting offset, the position of the first character - corresponding to this token in the source text. - Note that the difference between endOffset() and startOffset() may not be - equal to termText.length(), as the term text may have been altered by a - stemmer or some other filter. - - - - Returns this Token's ending offset, one greater than the position of the - last character corresponding to this token in the source text. - - - - Returns this Token's lexical type. Defaults to "word". - - - An Analyzer that uses WhitespaceTokenizer. - - - A WhitespaceTokenizer is a tokenizer that divides text at whitespace. - Adjacent sequences of non-Whitespace characters form tokens. - - - - Construct a new WhitespaceTokenizer. - - - Collects only characters which do not satisfy - {@link Character#isWhitespace(char)}. - - - Loader for text files that represent a list of stopwords. + + Assumes the fields are not storing term vectors. - Gerhard Schwarz - - $Id: WordlistLoader.java 192989 2005-06-22 19:59:03Z dnaber $ - - - - Loads a text file and adds every line as an entry to a HashSet (omitting - leading and trailing whitespace). Every line of the file should contain only - one word. The words need to be in lowercase if you make use of an - Analyzer which uses LowerCaseFilter (like StandardAnalyzer). + The names of the fields + + Whether the fields are indexed or not - - File containing the wordlist - A HashSet with the file's words - + + - - Reads lines from a Reader and adds every line as an entry to a HashSet (omitting - leading and trailing whitespace). Every line of the Reader should contain only - one word. The words need to be in lowercase if you make use of an - Analyzer which uses LowerCaseFilter (like StandardAnalyzer). + + Calls 5 parameter add with false for all TermVector parameters. - Reader containing the wordlist + The name of the Fieldable - A HashSet with the reader's words - + true if the field is indexed + + + - - Builds a wordlist table, using words as both keys and values - for backward compatibility. + + Calls 5 parameter add with false for term vector positions and offsets. - stopword set + The name of the field + + true if the field is indexed + + true if the term vector should be stored - - - Converts a Date to a string suitable for indexing. - RuntimeException if the date specified in the - method argument is before 1970 - - - - Converts a millisecond time to a string suitable for indexing. - RuntimeException if the time specified in the - method argument is negative, that is, before 1970 - - - - Converts a string-encoded date into a millisecond time. - - - Converts a string-encoded date into a Date object. - - - - Converts a Date to a string suitable for indexing. + + If the field is not yet known, adds it. If it is known, checks to make + sure that the isIndexed flag is the same as was given previously for this + field. If not - marks it as being indexed. Same goes for the TermVector + parameters. - the date to be converted + The name of the field - the desired resolution, see [... 15438 lines stripped ...]