lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rm...@apache.org
Subject svn commit: r916666 [2/16] - in /lucene/java/branches/flex_1458: ./ contrib/ contrib/analyzers/common/ contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/ contrib/analyzers/c...
Date Fri, 26 Feb 2010 13:10:08 GMT
Modified: lucene/java/branches/flex_1458/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/CHANGES.txt?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/CHANGES.txt (original)
+++ lucene/java/branches/flex_1458/CHANGES.txt Fri Feb 26 13:09:54 2010
@@ -1,12 +1,26 @@
 Lucene Change Log
-$Id$
 
-Bug fixes
+======================= Flexible Indexing Branch =======================
+
+Changes in backwards compatibility policy
+
+* LUCENE-2111: UnicodeUtil now uses BytesRef for UTF-8 output, and
+  some method signatures have changed to CharSequence.  These are
+  advanced APIs and subject to change suddenly.
+  (Robert Muir, Mike McCandless)
+
+* LUCENE-1458: Flex API changes:
+    Directory.copy now copies all files (not just index files), since
+    what is and isn't and index file is now dependent on the codec
+    used. (Mike McCandless)
+
+Bug Fixes
 
- * LUCENE-2222: FixedIntBlockIndexInput incorrectly read one block of
-   0s before the actual data.  (Renaud Delbru via Mike McCandless)
+* LUCENE-2222: FixedIntBlockIndexInput incorrectly read one block of
+  0s before the actual data.  (Renaud Delbru via Mike McCandless)
 
 ======================= Trunk (not yet released) =======================
+
 Changes in backwards compatibility policy
 
 * LUCENE-1483: Removed utility class oal.util.SorterTemplate; this
@@ -25,15 +39,8 @@
   toString.  These are advanced APIs and subject to change suddenly.
   (Tim Smith via Mike McCandless)
 
-* LUCENE-2111: UnicodeUtil now uses BytesRef for UTF-8 output, and
-  some method signatures have changed to CharSequence.  These are
-  advanced APIs and subject to change suddenly.
-  (Robert Muir, Mike McCandless)
-
-* LUCENE-1458: Flex API changes:
-  - Directory.copy now copies all files (not just index files), since
-    what is and isn't and index file is now dependent on the codec
-    used.
+* LUCENE-2190: Removed deprecated customScore() and customExplain()
+  methods from experimental CustomScoreQuery.  (Uwe Schindler)
 
 Changes in runtime behavior
 
@@ -58,58 +65,52 @@
   until Lucene 4.0 the default one will be deprecated.
   (Shai Erera via Uwe Schindler) 
 
-* LUCENE-1609: Restore IndexReader.getTermInfosIndexDivisor (it was
-  accidentally removed in 3.0.0)  (Mike McCandless)
-
-* LUCENE-1972: Restore SortField.getComparatorSource (it was
-  accidentally removed in 3.0.0)  (John Wang via Uwe Schindler)
-
 * LUCENE-2177: Deprecate the Field ctors that take byte[] and Store.
   Since the removal of compressed fields, Store can only be YES, so
   it's not necessary to specify.  (Erik Hatcher via Mike McCandless)
 
-* LUCENE-2190: Added setNextReader method to CustomScoreQuery, which
-  is necessary with per-segment searching to notify the subclass
-  which reader the int doc, passed to customScore, refers to.  (Paul
-  chez Jamespot via Mike McCandless)
-
 * LUCENE-2200: Several final classes had non-overriding protected
   members. These were converted to private and unused protected
   constructors removed.  (Steven Rowe via Robert Muir)
 
-Bug fixes
+* LUCENE-2240: SimpleAnalyzer and WhitespaceAnalyzer now have
+  Version ctors.  (Simon Willnauer via Uwe Schindler)
+
+* LUCENE-2259: Add IndexWriter.deleteUnusedFiles, to attempt removing
+  unused files.  This is only useful on Windows, which prevents
+  deletion of open files. IndexWriter will eventually remove these
+  files itself; this method just lets you do so when you know the
+  files are no longer open by IndexReaders. (luocanrao via Mike
+  McCandless)
+
+* LUCENE-2281: added doBeforeFlush to IndexWriter to allow extensions to perform
+  operations before flush starts. Also exposed doAfterFlush as protected instead
+  of package-private. (Shai Erera via Mike McCandless)
 
-* LUCENE-2092: BooleanQuery was ignoring disableCoord in its hashCode
-  and equals methods, cause bad things to happen when caching
-  BooleanQueries.  (Chris Hostetter, Mike McCandless)
-
-* LUCENE-2095: Fixes: when two threads call IndexWriter.commit() at
-  the same time, it's possible for commit to return control back to
-  one of the threads before all changes are actually committed.
-  (Sanne Grinovero via Mike McCandless)
+Bug fixes
 
 * LUCENE-2119: Don't throw NegativeArraySizeException if you pass
   Integer.MAX_VALUE as nDocs to IndexSearcher search methods.  (Paul
   Taylor via Mike McCandless)
 
-* LUCENE-2132: Fix the demo result.jsp to use QueryParser with a 
-  Version argument.  (Brian Li via Robert Muir)
-
 * LUCENE-2142: FieldCacheImpl.getStringIndex no longer throws an
   exception when term count exceeds doc count.  (Mike McCandless)
 
-* LUCENE-2166: Don't incorrectly keep warning about the same immense
-  term, when IndexWriter.infoStream is on.  (Mike McCandless)
-
 * LUCENE-2104: NativeFSLock.release() would silently fail if the lock is held by 
   another thread/process.  (Shai Erera via Uwe Schindler)
 
-* LUCENE-2158: At high indexing rates, NRT reader could temporarily
-  lose deletions.  (Mike McCandless)
+* LUCENE-2216: OpenBitSet.hashCode returned different hash codes for
+  sets that only differed by trailing zeros. (Dawid Weiss, yonik)
+
+* LUCENE-2235: Implement missing PerFieldAnalyzerWrapper.getOffsetGap().
+  (Javier Godoy via Uwe Schindler)
+
+* LUCENE-2249: ParallelMultiSearcher should shut down thread pool on
+  close.  (Martin Traverso via Uwe Schindler)
   
-* LUCENE-2182: DEFAULT_ATTRIBUTE_FACTORY was failing to load
-  implementation class when interface was loaded by a different
-  class loader.  (Uwe Schindler, reported on java-user by Ahmed El-dawy)
+* LUCENE-2273: FieldCacheImpl.getCacheEntries() used WeakHashMap
+  incorrectly and lead to ConcurrentModificationException.
+  (Uwe Schindler, Robert Muir)
   
 New features
 
@@ -135,10 +136,19 @@
   stopwords, and implement many analyzers in contrib with it.  
   (Simon Willnauer via Robert Muir)
   
-Optimizations
+* LUCENE-2198: Support protected words in stemming TokenFilters using a
+  new KeywordAttribute.  (Simon Willnauer via Uwe Schindler)
+  
+* LUCENE-2183, LUCENE-2240, LUCENE-2241: Added Unicode 4 support
+  to CharTokenizer and its subclasses. CharTokenizer now has new
+  int-API which is conditionally preferred to the old char-API depending
+  on the provided Version. Version < 3.1 will use the char-API.
+  (Simon Willnauer via Uwe Schindler)
 
-* LUCENE-2086: When resolving deleted terms, do so in term sort order
-  for better performance. (Bogdan Ghidireac via Mike McCandless)
+* LUCENE-2247: Added a CharArrayMap<V> for performance improvements
+  in some stemmers and synonym filters. (Uwe Schindler)
+
+Optimizations
 
 * LUCENE-2075: Terms dict cache is now shared across threads instead
   of being stored separately in thread local storage.  Also fixed
@@ -155,13 +165,12 @@
 * LUCENE-2137: Switch to AtomicInteger for some ref counting (Earwin
   Burrfoot via Mike McCandless)
 
-* LUCENE-2123: Move FuzzyQuery rewrite as separate RewriteMode into
-  MTQ. This also fixes a slowdown / memory issue added by LUCENE-504.
+* LUCENE-2123, LUCENE-2261: Move FuzzyQuery rewrite to separate RewriteMode 
+  into MultiTermQuery. The number of fuzzy expansions can be specified with
+  the maxExpansions parameter to FuzzyQuery, but the default is limited to
+  BooleanQuery.maxClauseCount() as before. 
   (Uwe Schindler, Robert Muir, Mike McCandless)
 
-* LUCENE-2137: Switch to AtomicInteger for some ref counting (Earwin
-  Burrfoot via Mike McCandless)
-
 * LUCENE-2135: On IndexReader.close, forcefully evict any entries from
   the FieldCache rather than waiting for the WeakHashMap to release
   the reference (Mike McCandless)
@@ -188,6 +197,9 @@
 * LUCENE-2188: Add a utility class for tracking deprecated overridden
   methods in non-final subclasses.
   (Uwe Schindler, Robert Muir)
+
+* LUCENE-2195: Speedup CharArraySet if set is empty.
+  (Simon Willnauer via Robert Muir)
    
 Build
 
@@ -206,15 +218,118 @@
 * LUCENE-2065: Use Java 5 generics throughout our unit tests.  (Kay
   Kay via Mike McCandless)
 
-* LUCENE-2114: Change TestFilteredSearch to test on multi-segment
-  index as well; improve javadocs of Filter to call out that the
-  provided reader is per-segment (Simon Willnauer via Mike McCandless)
-
 * LUCENE-2155: Fix time and zone dependent localization test failures
   in queryparser tests. (Uwe Schindler, Chris Male, Robert Muir)
 
 * LUCENE-2170: Fix thread starvation problems.  (Uwe Schindler)
 
+* LUCENE-2248, LUCENE-2251: Refactor tests to not use Version.LUCENE_CURRENT,
+  but instead use a global static value from LuceneTestCase(J4), that
+  contains the release version.  (Uwe Schindler, Simon Willnauer)
+  
+================== Release 2.9.2 / 3.0.1 2010-02-26 ====================
+
+Changes in backwards compatibility policy
+
+* LUCENE-2123 (3.0.1 only): Removed the protected inner class ScoreTerm
+  from FuzzyQuery. The change was needed because the comparator of this
+  class had to be changed in an incompatible way. The class was never
+  intended to be public.  (Uwe Schindler, Mike McCandless)
+  
+Bug fixes
+
+ * LUCENE-2092: BooleanQuery was ignoring disableCoord in its hashCode
+   and equals methods, cause bad things to happen when caching
+   BooleanQueries.  (Chris Hostetter, Mike McCandless)
+
+ * LUCENE-2095: Fixes: when two threads call IndexWriter.commit() at
+   the same time, it's possible for commit to return control back to
+   one of the threads before all changes are actually committed.
+   (Sanne Grinovero via Mike McCandless)
+
+ * LUCENE-2132 (3.0.1 only): Fix the demo result.jsp to use QueryParser
+   with a Version argument.  (Brian Li via Robert Muir)
+
+ * LUCENE-2166: Don't incorrectly keep warning about the same immense
+   term, when IndexWriter.infoStream is on.  (Mike McCandless)
+
+ * LUCENE-2158: At high indexing rates, NRT reader could temporarily
+   lose deletions.  (Mike McCandless)
+  
+ * LUCENE-2182: DEFAULT_ATTRIBUTE_FACTORY was failing to load
+   implementation class when interface was loaded by a different
+   class loader.  (Uwe Schindler, reported on java-user by Ahmed El-dawy)
+
+ * LUCENE-2257: Increase max number of unique terms in one segment to
+   termIndexInterval (default 128) * ~2.1 billion = ~274 billion.
+   (Tom Burton-West via Mike McCandless)
+  
+ * LUCENE-2260: Fixed AttributeSource to not hold a strong
+   reference to the Attribute/AttributeImpl classes which prevents
+   unloading of custom attributes loaded by other classloaders
+   (e.g. in Solr plugins).  (Uwe Schindler)
+ 
+ * LUCENE-1941: Fix Min/MaxPayloadFunction returns 0 when
+   only one payload is present.  (Erik Hatcher, Mike McCandless
+   via Uwe Schindler)
+
+ * LUCENE-2270: Queries consisting of all zero-boost clauses
+   (for example, text:foo^0) sorted incorrectly and produced
+   invalid docids. (yonik)
+
+API Changes
+
+ * LUCENE-1609 (3.0.1 only): Restore IndexReader.getTermInfosIndexDivisor
+   (it was accidentally removed in 3.0.0)  (Mike McCandless)
+
+ * LUCENE-1972 (3.0.1 only): Restore SortField.getComparatorSource
+   (it was accidentally removed in 3.0.0)  (John Wang via Uwe Schindler)
+
+ * LUCENE-2190: Added a new class CustomScoreProvider to function package
+   that can be subclassed to provide custom scoring to CustomScoreQuery.
+   The methods in CustomScoreQuery that did this before were deprecated
+   and replaced by a method getCustomScoreProvider(IndexReader) that
+   returns a custom score implementation using the above class. The change
+   is necessary with per-segment searching, as CustomScoreQuery is
+   a stateless class (like all other Queries) and does not know about
+   the currently searched segment. This API works similar to Filter's
+   getDocIdSet(IndexReader).  (Paul chez Jamespot via Mike McCandless,
+   Uwe Schindler)
+
+ * LUCENE-2080: Deprecate Version.LUCENE_CURRENT, as using this constant
+   will cause backwards compatibility problems when upgrading Lucene. See
+   the Version javadocs for additional information.
+   (Robert Muir)
+
+Optimizations
+
+ * LUCENE-2086: When resolving deleted terms, do so in term sort order
+   for better performance (Bogdan Ghidireac via Mike McCandless)
+
+ * LUCENE-2123 (partly, 3.0.1 only): Fixes a slowdown / memory issue
+   added by LUCENE-504.  (Uwe Schindler, Robert Muir, Mike McCandless)
+
+ * LUCENE-2258: Remove unneeded synchronization in FuzzyTermEnum.
+   (Uwe Schindler, Robert Muir)
+
+Test Cases
+
+ * LUCENE-2114: Change TestFilteredSearch to test on multi-segment
+   index as well. (Simon Willnauer via Mike McCandless)
+
+ * LUCENE-2211: Improves BaseTokenStreamTestCase to use a fake attribute
+   that checks if clearAttributes() was called correctly.
+   (Uwe Schindler, Robert Muir)
+
+ * LUCENE-2207, LUCENE-2219: Improve BaseTokenStreamTestCase to check if 
+   end() is implemented correctly.  (Koji Sekiguchi, Robert Muir)
+
+Documentation
+
+ * LUCENE-2114: Improve javadocs of Filter to call out that the
+   provided reader is per-segment (Simon Willnauer via Mike
+   McCandless)
+ 
 ======================= Release 3.0.0 2009-11-25 =======================
 
 Changes in backwards compatibility policy
@@ -520,10 +635,10 @@
     code to implement this method.  If you already extend
     IndexSearcher, no further changes are needed to use Collector.
     
-    Finally, the values Float.NaN, Float.NEGATIVE_INFINITY and
-    Float.POSITIVE_INFINITY are not valid scores.  Lucene uses these
-    values internally in certain places, so if you have hits with such
-    scores, it will cause problems. (Shai Erera via Mike McCandless)
+    Finally, the values Float.NaN and Float.NEGATIVE_INFINITY are not
+    valid scores.  Lucene uses these values internally in certain
+    places, so if you have hits with such scores, it will cause
+    problems. (Shai Erera via Mike McCandless)
 
  * LUCENE-1687: All methods and parsers from the interface ExtendedFieldCache
     have been moved into FieldCache. ExtendedFieldCache is now deprecated and
@@ -601,7 +716,7 @@
     
  * LUCENE-1575: As of 2.9, the core collectors as well as
     IndexSearcher's search methods that return top N results, no
-    longer filter out zero scoring documents. If you rely on this
+    longer filter documents with scores <= 0.0. If you rely on this
     functionality you can use PositiveScoresOnlyCollector like this:
 
     <code>

Modified: lucene/java/branches/flex_1458/NOTICE.txt
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/NOTICE.txt?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/NOTICE.txt (original)
+++ lucene/java/branches/flex_1458/NOTICE.txt Fri Feb 26 13:09:54 2010
@@ -5,7 +5,10 @@
 The Apache Software Foundation (http://www.apache.org/).
 
 The snowball stemmers in
-  contrib/snowball/src/java/net/sf/snowball
+  contrib/analyzers/common/src/java/net/sf/snowball
+were developed by Martin Porter and Richard Boulton.
+The snowball stopword lists in
+  contrib/analyzers/common/src/resources/org/apache/lucene/analysis/snowball
 were developed by Martin Porter and Richard Boulton.
 The full snowball package is available from
   http://snowball.tartarus.org/
@@ -20,11 +23,21 @@
 contrib/analyzers/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt.
 See http://members.unine.ch/jacques.savoy/clef/index.html.
 
+The Romanian analyzer (contrib/analyzers) comes with a default
+stopword list that is BSD-licensed created by Jacques Savoy.  The file resides in
+contrib/analyzers/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt.
+See http://members.unine.ch/jacques.savoy/clef/index.html.
+
 The Bulgarian analyzer (contrib/analyzers) comes with a default
 stopword list that is BSD-licensed created by Jacques Savoy.  The file resides in
 contrib/analyzers/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt.
 See http://members.unine.ch/jacques.savoy/clef/index.html.
 
+The Hindi analyzer (contrib/analyzers) comes with a default
+stopword list that is BSD-licensed created by Jacques Savoy.  The file resides in
+contrib/analyzers/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt.
+See http://members.unine.ch/jacques.savoy/clef/index.html.
+
 Includes lib/servlet-api-2.4.jar from  Apache Tomcat
 
 The SmartChineseAnalyzer source code (under contrib/analyzers) was

Modified: lucene/java/branches/flex_1458/README.txt
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/README.txt?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/README.txt (original)
+++ lucene/java/branches/flex_1458/README.txt Fri Feb 26 13:09:54 2010
@@ -1,7 +1,5 @@
 Lucene README file
 
-$Id$
-
 INTRODUCTION
 
 Lucene is a Java full-text search engine.  Lucene is not a complete
@@ -27,7 +25,7 @@
 
 contrib/*
   Contributed code which extends and enhances Lucene, but is not
-  part of the core library.  Of special note are the JAR files in the analyzers and snowball directory which
+  part of the core library.  Of special note are the JAR files in the analyzers directory which
   contain various analyzers that people may find useful in place of the StandardAnalyzer.
 
 

Modified: lucene/java/branches/flex_1458/build.xml
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/build.xml?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/build.xml (original)
+++ lucene/java/branches/flex_1458/build.xml Fri Feb 26 13:09:54 2010
@@ -99,10 +99,10 @@
       <echo>Initial SVN checkout for '${backwards.branch}'...</echo>
       <mkdir dir="${backwards.dir}"/>
       <exec dir="${backwards.dir}" executable="${svn.exe}" failifexecutionfails="false" failonerror="true">
-        <arg line="checkout -r ${backwards.rev} --depth empty http://svn.apache.org/repos/asf/lucene/java/branches/${backwards.branch} ${backwards.branch}"/>
+        <arg line="checkout --trust-server-cert --non-interactive -r ${backwards.rev} --depth empty https://svn.apache.org/repos/asf/lucene/java/branches/${backwards.branch} ${backwards.branch}"/>
       </exec>
       <exec dir="${backwards.dir}" executable="${svn.exe}" failifexecutionfails="false" failonerror="true">
-        <arg line="update -r ${backwards.rev} --set-depth infinity ${backwards.branch}/src"/>
+        <arg line="update --trust-server-cert --non-interactive -r ${backwards.rev} --set-depth infinity ${backwards.branch}/src"/>
       </exec>
     </sequential>
   </target>
@@ -111,7 +111,7 @@
     <sequential>
       <echo>Update backwards branch '${backwards.branch}' to revision ${backwards.rev}...</echo>
       <exec dir="${backwards.dir}" executable="${svn.exe}" failifexecutionfails="false" failonerror="true">
-        <arg line="update -r ${backwards.rev} ${backwards.branch}"/>
+        <arg line="update --trust-server-cert --non-interactive -r ${backwards.rev} ${backwards.branch}"/>
       </exec>
     </sequential>
   </target>
@@ -333,7 +333,6 @@
           <packageset dir="contrib/queries/src/java"/>
           <packageset dir="contrib/regex/src/java"/>
           <packageset dir="contrib/remote/src/java"/>
-          <packageset dir="contrib/snowball/src/java"/>
           <packageset dir="contrib/spatial/src/java"/>
           <packageset dir="contrib/spellchecker/src/java"/>
           <packageset dir="contrib/surround/src/java"/>
@@ -352,7 +351,7 @@
   
           <group title="Demo" packages="org.apache.lucene.demo*"/>
   
-          <group title="contrib: Analysis" packages="org.apache.lucene.analysis.*"/>
+          <group title="contrib: Analysis" packages="org.apache.lucene.analysis.*:org.tartarus.snowball*"/>
           <group title="contrib: Ant" packages="org.apache.lucene.ant*"/>
           <group title="contrib: Benchmark" packages="org.apache.lucene.benchmark*"/>
           <group title="contrib: ICU" packages="org.apache.lucene.collation*"/>
@@ -366,7 +365,6 @@
           <group title="contrib: Queries" packages="org.apache.lucene.search.similar*"/>
           <group title="contrib: Query Parser" packages="org.apache.lucene.queryParser.*"/>
           <group title="contrib: RegEx" packages="org.apache.lucene.search.regex*:org.apache.regexp*"/>
-          <group title="contrib: Snowball" packages="org.apache.lucene.analysis.snowball*:net.sf.snowball*"/>
           <group title="contrib: Spatial" packages="org.apache.lucene.spatial*"/>
           <group title="contrib: SpellChecker" packages="org.apache.lucene.search.spell*"/>
           <group title="contrib: Surround Parser" packages="org.apache.lucene.queryParser.surround*"/>
@@ -552,6 +550,48 @@
   </target>
 	
   <!-- ================================================================== -->
+  <!-- support for signing the artifacts using gpg                        -->
+  <!-- ================================================================== -->
+  <target name="clean-dist-signatures">
+    <delete failonerror="false">
+      <fileset dir="${dist.dir}">
+        <include name="**/*.asc"/>
+      </fileset>
+    </delete>
+  </target>
+  
+  <target name="sign-artifacts" depends="clean-dist-signatures">
+    <available property="gpg.input.handler" classname="org.apache.tools.ant.input.SecureInputHandler"
+      value="org.apache.tools.ant.input.SecureInputHandler"/>
+    <!--else:--><property name="gpg.input.handler" value="org.apache.tools.ant.input.DefaultInputHandler"/>
+    <input message="Enter GPG keystore password: >" addproperty="gpg.passphrase">
+      <handler classname="${gpg.input.handler}" />
+    </input>
+    
+    <apply executable="${gpg.exe}" inputstring="${gpg.passphrase}"
+      dest="${dist.dir}" type="file" maxparallel="1" verbose="yes">
+      <arg value="--passphrase-fd"/>
+      <arg value="0"/>
+      <arg value="--batch"/>
+      <arg value="--armor"/>
+      <arg value="--default-key"/>
+      <arg value="${gpg.key}"/>
+      <arg value="--output"/>
+      <targetfile/>
+      <arg value="--detach-sig"/>
+      <srcfile/>
+      
+      <fileset dir="${dist.dir}">
+        <include name="**/*.jar"/>
+        <include name="**/*.zip"/>
+        <include name="**/*.tar.gz"/>
+        <include name="**/*.pom"/>
+      </fileset>
+      <globmapper from="*" to="*.asc"/>
+    </apply>
+  </target>
+
+  <!-- ================================================================== -->
   <!-- Build the JavaCC files into the source tree                        -->
   <!-- ================================================================== -->
   <target name="jjdoc">
@@ -708,6 +748,7 @@
   <macrodef name="lucene-checksum">
     <attribute name="file"/>
     <sequential>
+      <echo>Building checksums for '@{file}'</echo>
       <checksum file="@{file}" algorithm="md5" format="MD5SUM" forceoverwrite="yes" readbuffersize="65536"/>
       <checksum file="@{file}" algorithm="sha1" format="MD5SUM" forceoverwrite="yes" readbuffersize="65536"/>
     </sequential>

Propchange: lucene/java/branches/flex_1458/build.xml
------------------------------------------------------------------------------
    svn:mergeinfo = /lucene/java/branches/lucene_2_9/build.xml:909334

Modified: lucene/java/branches/flex_1458/common-build.xml
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/common-build.xml?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/common-build.xml (original)
+++ lucene/java/branches/flex_1458/common-build.xml Fri Feb 26 13:09:54 2010
@@ -43,7 +43,7 @@
   <property name="dev.version" value="3.1-dev"/>
   <property name="version" value="${dev.version}"/>
   <property name="backwards.branch" value="flex_1458_3_0_back_compat_tests"/>
-  <property name="backwards.rev" value="915852"/>
+  <property name="backwards.rev" value="916665"/>
   <property name="spec.version" value="${version}"/>	
   <property name="year" value="2000-${current.year}"/>
   <property name="final.name" value="lucene-${name}-${version}"/>
@@ -113,6 +113,9 @@
   <property name="svnversion.exe" value="svnversion" />
   <property name="svn.exe" value="svn" />
   
+  <property name="gpg.exe" value="gpg" />
+  <property name="gpg.key" value="CODE SIGNING KEY" />
+
   <condition property="build-1-5-contrib">
      <equals arg1="1.5" arg2="${ant.java.version}" />
   </condition>
@@ -633,8 +636,10 @@
           doctitle="@{title}"
           maxmemory="${javadoc.maxmemory}"
           bottom="Copyright &amp;copy; ${year} Apache Software Foundation.  All Rights Reserved.">
-        <tag name="todo" description="To Do:"/>
-        <tag name="uml.property" description="UML Property:"/>
+        <tag name="lucene.experimental" 
+      	description="WARNING: This API is experimental and might change in incompatible ways in the next release."/>
+        <tag name="lucene.internal"
+        description="NOTE: This API is for Lucene internal purposes only and might change in incompatible ways in the next release."/>
       	<link offline="true" packagelistLoc="${javadoc.dir}"/>
       	
       	<sources />

Propchange: lucene/java/branches/flex_1458/contrib/
------------------------------------------------------------------------------
--- svn:mergeinfo (original)
+++ svn:mergeinfo Fri Feb 26 13:09:54 2010
@@ -1,5 +1,5 @@
 /lucene/java/branches/lucene_2_4/contrib:748824
-/lucene/java/branches/lucene_2_9/contrib:817269-818600,825998,829134,829816,829881,831036,896850
+/lucene/java/branches/lucene_2_9/contrib:817269-818600,825998,829134,829816,829881,831036,896850,909334
 /lucene/java/branches/lucene_2_9_back_compat_tests/contrib:818601-821336
 /lucene/java/branches/lucene_3_0/contrib:880793,896906
-/lucene/java/trunk/contrib:824912-825292,827043-833960,880727-886190,889185,889622,889667,889866-899001
+/lucene/java/trunk/contrib:824912-825292,827043-833960,880727-886190,889185,889614-916543

Modified: lucene/java/branches/flex_1458/contrib/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/CHANGES.txt?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/CHANGES.txt (original)
+++ lucene/java/branches/flex_1458/contrib/CHANGES.txt Fri Feb 26 13:09:54 2010
@@ -15,6 +15,10 @@
    preserved, but some protected/public member variables changed type. This 
    does NOT affect java code/class files produced by the snowball compiler, 
    but technically is a backwards compatibility break.  (Robert Muir)
+   
+ * LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers.
+   Be sure to remove any old obselete lucene-snowball jar files from your
+   classpath!  (Robert Muir)
     
 Changes in runtime behavior
 
@@ -23,10 +27,11 @@
    used with Version > 3.0 and the TurkishStemmer.
    (Robert Muir via Simon Willnauer)  
 
-Bug fixes
+ * LUCENE-2055: GermanAnalyzer now uses the Snowball German2 algorithm and 
+   stopwords list by default for Version > 3.0.
+   (Robert Muir, Uwe Schindler, Simon Willnauer)
 
- * LUCENE-2199: ShingleFilter skipped over tri-gram shingles if outputUnigram
-   was set to false. (Simon Willnauer)
+Bug fixes
 
  * LUCENE-2068: Fixed ReverseStringFilter which was not aware of supplementary
    characters. During reverse the filter created unpaired surrogates, which
@@ -34,27 +39,36 @@
    now reverses supplementary characters correctly if used with Version > 3.0.
    (Simon Willnauer, Robert Muir)
 
- * LUCENE-2144: Fix InstantiatedIndex to handle termDocs(null)
-   correctly (enumerate all non-deleted docs).  (Karl Wettin via Mike
-   McCandless)
-   
  * LUCENE-2035: TokenSources.getTokenStream() does not assign  positionIncrement. 
    (Christopher Morris via Mark Miller)
+  
+ * LUCENE-2055: Deprecated RussianTokenizer, RussianStemmer, RussianStemFilter,
+   FrenchStemmer, FrenchStemFilter, DutchStemmer, and DutchStemFilter. For
+   these Analyzers, SnowballFilter is used instead (for Version > 3.0), as
+   the previous code did not always implement the Snowball algorithm correctly.
+   Additionally, for Version > 3.0, the Snowball stopword lists are used by
+   default.  (Robert Muir, Uwe Schindler, Simon Willnauer)
+
+ * LUCENE-2278: FastVectorHighlighter: Highlighted term is out of alignment
+   in multi-valued NOT_ANALYZED field. (Koji Sekiguchi)
+ 
+ * LUCENE-2284: MatchAllDocsQueryNode toString() created an invalid XML tag.
+   (Frank Wesemann via Robert Muir)
    
 API Changes
 
- * LUCENE-2108: Add SpellChecker.close, to close the underlying
-   reader.  (Eirik Bjørsnøs via Mike McCandless)
- 
  * LUCENE-2147: Spatial GeoHashUtils now always decode GeoHash strings
    with full precision. GeoHash#decode_exactly(String) was merged into
    GeoHash#decode(String). (Chris Male, Simon Willnauer)
    
- * LUCENE-2165: Add a constructor to SnowballAnalyzer that takes a Set of 
-   stopwords, and deprecate the String[] one.  (Nick Burch via Robert Muir)
-
  * LUCENE-2204: Change some package private classes/members to publicly accessible to implement
    custom FragmentsBuilders. (Koji Sekiguchi)
+
+ * LUCENE-2055: Integrate snowball into contrib/analyzers. SnowballAnalyzer is
+   now deprecated in favor of language-specific analyzers which contain things
+   such as stopword lists and any language-specific processing in addition to
+   stemming. Add Turkish and Romanian stopwords lists to support this.
+   (Robert Muir, Uwe Schindler, Simon Willnauer)
    
 New features
 
@@ -67,27 +81,35 @@
    customizable field naming scheme.
    (Simon Willnauer)
 
- * LUCENE-2108: Spellchecker now safely supports concurrent modifications to
-   the spell-index. Threads can safely obtain term suggestions while the spell-
-   index is rebuild, cleared or reset. Internal IndexSearcher instances remain
-   open until the last thread accessing them releases the reference.
-   (Simon Willnauer)
-
  * LUCENE-2067: Add a Czech light stemmer. CzechAnalyzer will now stem words
    when Version is set to 3.1 or higher.  (Robert Muir)
    
  * LUCENE-2062: Add a Bulgarian analyzer.  (Robert Muir, Simon Willnauer)
 
-Build
+ * LUCENE-2206: Add Snowball's stopword lists for Danish, Dutch, English,
+   Finnish, French, German, Hungarian, Italian, Norwegian, Russian, Spanish, 
+   and Swedish. These can be loaded with WordListLoader.getSnowballWordSet.
+   (Robert Muir, Simon Willnauer)
+
+ * LUCENE-2243: Add DisjunctionMaxQuery support for FastVectorHighlighter.
+   (Koji Sekiguchi)
+
+ * LUCENE-2218: ShingleFilter supports minimum shingle size, and the separator
+   character is now configurable. Its also up to 20% faster. 
+   (Steven Rowe via Robert Muir)
+
+ * LUCENE-2234: Add a Hindi analyzer.  (Robert Muir)
+
+ * LUCENE-2055: Add analyzers/misc/StemmerOverrideFilter. This filter provides
+   the ability to override any stemmer with a custom dictionary map.
+   (Robert Muir, Uwe Schindler, Simon Willnauer)
 
- * LUCENE-2117: SnowballAnalyzer now holds a runtime-dependency on
-   contrib-analyzers to correctly handle the unique Turkish casing behavior.
-   (Robert Muir via Simon Willnauer)  
+Build
 
  * LUCENE-2124: Moved the JDK-based collation support from contrib/collation 
    into core, and moved the ICU-based collation support into contrib/icu.  
    (Steven Rowe, Robert Muir)
-
+   
 Optimizations
 
  * LUCENE-2157: DelimitedPayloadTokenFilter no longer copies the buffer
@@ -104,6 +126,50 @@
  * LUCENE-2115: Cutover contrib tests to use Java5 generics.  (Kay Kay
    via Mike McCandless)
 
+Other
+
+ * LUCENE-1845: Updated bdb-je jar from version 3.3.69 to 3.3.93.
+   (Simon Willnauer via Mike McCandless)
+
+================== Release 2.9.2 / 3.0.1 2010-02-26 ====================
+
+New features
+
+ * LUCENE-2108: Spellchecker now safely supports concurrent modifications to
+   the spell-index. Threads can safely obtain term suggestions while the spell-
+   index is rebuild, cleared or reset. Internal IndexSearcher instances remain
+   open until the last thread accessing them releases the reference.
+   (Simon Willnauer)
+
+Bug Fixes
+
+ * LUCENE-2144: Fix InstantiatedIndex to handle termDocs(null)
+   correctly (enumerate all non-deleted docs).  (Karl Wettin via Mike
+   McCandless)
+
+ * LUCENE-2199: ShingleFilter skipped over tri-gram shingles if outputUnigram
+   was set to false. (Simon Willnauer)
+  
+ * LUCENE-2211: Fix missing clearAttributes() calls:
+   ShingleMatrix, PrefixAware, compounds, NGramTokenFilter,
+   EdgeNGramTokenFilter, Highlighter, and MemoryIndex.
+   (Uwe Schindler, Robert Muir)
+
+ * LUCENE-2207, LUCENE-2219: Fix incorrect offset calculations in end() for 
+   CJKTokenizer, ChineseTokenizer, SmartChinese SentenceTokenizer, 
+   and WikipediaTokenizer.  (Koji Sekiguchi, Robert Muir)
+   
+ * LUCENE-2266: Fixed offset calculations in NGramTokenFilter and 
+   EdgeNGramTokenFilter.  (Joe Calderon, Robert Muir via Uwe Schindler)
+   
+API Changes
+
+ * LUCENE-2108: Add SpellChecker.close, to close the underlying
+   reader.  (Eirik Bjørsnøs via Mike McCandless)
+
+ * LUCENE-2165: Add a constructor to SnowballAnalyzer that takes a Set of 
+   stopwords, and deprecate the String[] one.  (Nick Burch via Robert Muir)
+   
 ======================= Release 3.0.0 2009-11-25 =======================
 
 Changes in backwards compatibility policy
@@ -124,6 +190,10 @@
    text exactly the same as LowerCaseFilter. Please use LowerCaseFilter
    instead, which has the same functionality.  (Robert Muir)
    
+ * LUCENE-2051: Contrib Analyzer setters were deprecated and replaced
+   with ctor arguments / Version number.  Also stop word lists
+   were unified.  (Simon Willnauer)
+
 Bug fixes
 
  * LUCENE-1781: Fixed various issues with the lat/lng bounding box
@@ -163,6 +233,7 @@
    Previous versions were loading the stopword files each time a new
    instance was created. This might improve performance for applications
    creating lots of instances of these Analyzers. (Simon Willnauer) 
+
 Documentation
 
  * LUCENE-1916: Translated documentation in the smartcn hhmm package.
@@ -176,7 +247,6 @@
  * LUCENE-2031: Moved PatternAnalyzer from contrib/memory into
    contrib/analyzers/common, under miscellaneous.  (Robert Muir)
    
-Test Cases
 ======================= Release 2.9.1 2009-11-06 =======================
 
 Changes in backwards compatibility policy

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/build.xml
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/build.xml?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/build.xml (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/build.xml Fri Feb 26 13:09:54 2010
@@ -35,4 +35,20 @@
     <path refid="junit-path"/>
     <pathelement location="${build.dir}/classes/java"/>
   </path>	
+
+  <target name="compile-test" depends="download-snowball-vocab-tests, common.compile-test" />
+  <property name="snowball.vocab.rev" value="500"/>
+  <property name="snowball.vocab.url" 
+            value="svn://svn.tartarus.org/snowball/trunk/data"/>
+  <property name="snowball.vocab.dir" value="src/test/org/apache/lucene/analysis/snowball"/>
+		
+  <target name="download-snowball-vocab-tests" depends="compile-core"
+	      description="Downloads Snowball vocabulary tests">
+	<sequential>
+	  <mkdir dir="${snowball.vocab.dir}"/>
+	    <exec dir="${snowball.vocab.dir}" executable="${svn.exe}" failifexecutionfails="false" failonerror="true">
+	      <arg line="checkout --trust-server-cert --non-interactive -r ${snowball.vocab.rev} ${snowball.vocab.url}"/>
+	    </exec>
+	</sequential>
+  </target>
 </project>

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -26,6 +26,8 @@
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.LowerCaseFilter;
 import org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents; // javadoc @link
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.StopwordAnalyzerBase;
 import org.apache.lucene.analysis.TokenStream;
@@ -93,6 +95,8 @@
       }
     }
   }
+  
+  private final Set<?> stemExclusionSet;
 
   /**
    * Builds an analyzer with the default stop words: {@link #DEFAULT_STOPWORD_FILE}.
@@ -110,7 +114,25 @@
    *          a stopword set
    */
   public ArabicAnalyzer(Version matchVersion, Set<?> stopwords){
+    this(matchVersion, stopwords, CharArraySet.EMPTY_SET);
+  }
+
+  /**
+   * Builds an analyzer with the given stop word. If a none-empty stem exclusion set is
+   * provided this analyzer will add a {@link KeywordMarkerTokenFilter} before
+   * {@link ArabicStemFilter}.
+   * 
+   * @param matchVersion
+   *          lucene compatibility version
+   * @param stopwords
+   *          a stopword set
+   * @param stemExclusionSet
+   *          a set of terms not to be stemmed
+   */
+  public ArabicAnalyzer(Version matchVersion, Set<?> stopwords, Set<?> stemExclusionSet){
     super(matchVersion, stopwords);
+    this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(
+        matchVersion, stemExclusionSet));
   }
 
   /**
@@ -145,17 +167,22 @@
    * Creates {@link TokenStreamComponents} used to tokenize all the text in the provided {@link Reader}.
    *
    * @return {@link TokenStreamComponents} built from an {@link ArabicLetterTokenizer} filtered with
-   * 			{@link LowerCaseFilter}, {@link StopFilter}, {@link ArabicNormalizationFilter}
+   * 			{@link LowerCaseFilter}, {@link StopFilter}, {@link ArabicNormalizationFilter},
+   *      {@link KeywordMarkerTokenFilter} if a stem exclusion set is provided
    *            and {@link ArabicStemFilter}.
    */
   @Override
   protected TokenStreamComponents createComponents(String fieldName,
       Reader reader) {
-    final Tokenizer source = new ArabicLetterTokenizer(reader);
+    final Tokenizer source = new ArabicLetterTokenizer(matchVersion, reader);
     TokenStream result = new LowerCaseFilter(matchVersion, source);
     // the order here is important: the stopword list is not normalized!
     result = new StopFilter( matchVersion, result, stopwords);
+    // TODO maybe we should make ArabicNormalization filter also KeywordAttribute aware?!
     result = new ArabicNormalizationFilter(result);
+    if(!stemExclusionSet.isEmpty()) {
+      result = new KeywordMarkerTokenFilter(result, stemExclusionSet);
+    }
     return new TokenStreamComponents(source, new ArabicStemFilter(result));
   }
 }

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicLetterTokenizer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicLetterTokenizer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicLetterTokenizer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicLetterTokenizer.java Fri Feb 26 13:09:54 2010
@@ -18,8 +18,10 @@
 
 import java.io.Reader;
 
+import org.apache.lucene.analysis.CharTokenizer;
 import org.apache.lucene.analysis.LetterTokenizer;
 import org.apache.lucene.util.AttributeSource;
+import org.apache.lucene.util.Version;
 
 /**
  * Tokenizer that breaks text into runs of letters and diacritics.
@@ -27,28 +29,101 @@
  * The problem with the standard Letter tokenizer is that it fails on diacritics.
  * Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
  * </p>
- *
+ * <p>
+ * <a name="version"/>
+ * You must specify the required {@link Version} compatibility when creating
+ * {@link ArabicLetterTokenizer}:
+ * <ul>
+ * <li>As of 3.1, {@link CharTokenizer} uses an int based API to normalize and
+ * detect token characters. See {@link #isTokenChar(int)} and
+ * {@link #normalize(int)} for details.</li>
+ * </ul>
  */
 public class ArabicLetterTokenizer extends LetterTokenizer {
 
+  
+  /**
+   * Construct a new ArabicLetterTokenizer.
+   * @param matchVersion Lucene version
+   * to match See {@link <a href="#version">above</a>}
+   * 
+   * @param in
+   *          the input to split up into tokens
+   */
+  public ArabicLetterTokenizer(Version matchVersion, Reader in) {
+    super(matchVersion, in);
+  }
+
+  /**
+   * Construct a new ArabicLetterTokenizer using a given {@link AttributeSource}.
+   * 
+   * @param matchVersion
+   *          Lucene version to match See {@link <a href="#version">above</a>}
+   * @param source
+   *          the attribute source to use for this Tokenizer
+   * @param in
+   *          the input to split up into tokens
+   */
+  public ArabicLetterTokenizer(Version matchVersion, AttributeSource source, Reader in) {
+    super(matchVersion, source, in);
+  }
+
+  /**
+   * Construct a new ArabicLetterTokenizer using a given
+   * {@link org.apache.lucene.util.AttributeSource.AttributeFactory}. * @param
+   * matchVersion Lucene version to match See
+   * {@link <a href="#version">above</a>}
+   * 
+   * @param factory
+   *          the attribute factory to use for this Tokenizer
+   * @param in
+   *          the input to split up into tokens
+   */
+  public ArabicLetterTokenizer(Version matchVersion, AttributeFactory factory, Reader in) {
+    super(matchVersion, factory, in);
+  }
+  
+  /**
+   * Construct a new ArabicLetterTokenizer.
+   * 
+   * @deprecated use {@link #ArabicLetterTokenizer(Version, Reader)} instead. This will
+   *             be removed in Lucene 4.0.
+   */
+  @Deprecated
   public ArabicLetterTokenizer(Reader in) {
     super(in);
   }
 
+  /**
+   * Construct a new ArabicLetterTokenizer using a given {@link AttributeSource}.
+   * 
+   * @deprecated use {@link #ArabicLetterTokenizer(Version, AttributeSource, Reader)}
+   *             instead. This will be removed in Lucene 4.0.
+   */
+  @Deprecated
   public ArabicLetterTokenizer(AttributeSource source, Reader in) {
     super(source, in);
   }
 
+  /**
+   * Construct a new ArabicLetterTokenizer using a given
+   * {@link org.apache.lucene.util.AttributeSource.AttributeFactory}.
+   * 
+   * @deprecated use {@link #ArabicLetterTokenizer(Version, AttributeSource.AttributeFactory, Reader)}
+   *             instead. This will be removed in Lucene 4.0.
+   */
+  @Deprecated
   public ArabicLetterTokenizer(AttributeFactory factory, Reader in) {
     super(factory, in);
   }
   
+  
   /** 
    * Allows for Letter category or NonspacingMark category
-   * @see org.apache.lucene.analysis.LetterTokenizer#isTokenChar(char)
+   * @see org.apache.lucene.analysis.LetterTokenizer#isTokenChar(int)
    */
   @Override
-  protected boolean isTokenChar(char c) {
+  protected boolean isTokenChar(int c) {
     return super.isTokenChar(c) || Character.getType(c) == Character.NON_SPACING_MARK;
   }
 

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicStemFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicStemFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicStemFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ar/ArabicStemFilter.java Fri Feb 26 13:09:54 2010
@@ -19,31 +19,41 @@
 
 import java.io.IOException;
 
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
 /**
  * A {@link TokenFilter} that applies {@link ArabicStemmer} to stem Arabic words..
- * 
- */
+ * <p>
+ * To prevent terms from being stemmed use an instance of
+ * {@link KeywordMarkerTokenFilter} or a custom {@link TokenFilter} that sets
+ * the {@link KeywordAttribute} before this {@link TokenStream}.
+ * </p>
+ * @see KeywordMarkerTokenFilter */
 
 public final class ArabicStemFilter extends TokenFilter {
 
   private final ArabicStemmer stemmer;
   private final TermAttribute termAtt;
+  private final KeywordAttribute keywordAttr;
   
   public ArabicStemFilter(TokenStream input) {
     super(input);
     stemmer = new ArabicStemmer();
     termAtt = addAttribute(TermAttribute.class);
+    keywordAttr = addAttribute(KeywordAttribute.class);
   }
 
   @Override
   public boolean incrementToken() throws IOException {
     if (input.incrementToken()) {
-      int newlen = stemmer.stem(termAtt.termBuffer(), termAtt.termLength());
-      termAtt.setTermLength(newlen);
+      if(!keywordAttr.isKeyword()) {
+        final int newlen = stemmer.stem(termAtt.termBuffer(), termAtt.termLength());
+        termAtt.setTermLength(newlen);
+      }
       return true;
     } else {
       return false;

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -25,6 +25,8 @@
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.LowerCaseFilter;
 import org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents; // javadoc @link
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.StopwordAnalyzerBase;
 import org.apache.lucene.analysis.TokenStream;
@@ -88,6 +90,8 @@
       }
     }
   }
+  
+  private final Set<?> stemExclusionSet;
    
   /**
    * Builds an analyzer with the default stop words:
@@ -101,16 +105,27 @@
    * Builds an analyzer with the given stop words.
    */
   public BulgarianAnalyzer(Version matchVersion, Set<?> stopwords) {
-    super(matchVersion, stopwords);
+    this(matchVersion, stopwords, CharArraySet.EMPTY_SET);
   }
   
   /**
+   * Builds an analyzer with the given stop words and a stem exclusion set.
+   * If a stem exclusion set is provided this analyzer will add a {@link KeywordMarkerTokenFilter} 
+   * before {@link BulgarianStemFilter}.
+   */
+  public BulgarianAnalyzer(Version matchVersion, Set<?> stopwords, Set<?> stemExclusionSet) {
+    super(matchVersion, stopwords);
+    this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(
+        matchVersion, stemExclusionSet));  }
+  
+  /**
    * Creates a {@link TokenStreamComponents} which tokenizes all the text in the provided
    * {@link Reader}.
    * 
    * @return A {@link TokenStreamComponents} built from an {@link StandardTokenizer}
    *         filtered with {@link StandardFilter}, {@link LowerCaseFilter},
-   *         {@link StopFilter}, and {@link BulgarianStemFilter}.
+   *         {@link StopFilter}, {@link KeywordMarkerTokenFilter} if a stem
+   *         exclusion set is provided and {@link BulgarianStemFilter}.
    */
   @Override
   public TokenStreamComponents createComponents(String fieldName, Reader reader) {
@@ -118,6 +133,8 @@
     TokenStream result = new StandardFilter(source);
     result = new LowerCaseFilter(matchVersion, result);
     result = new StopFilter(matchVersion, result, stopwords);
+    if(!stemExclusionSet.isEmpty())
+      result = new KeywordMarkerTokenFilter(result, stemExclusionSet);
     result = new BulgarianStemFilter(result);
     return new TokenStreamComponents(source, result);
   }

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianStemFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianStemFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianStemFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/bg/BulgarianStemFilter.java Fri Feb 26 13:09:54 2010
@@ -19,29 +19,40 @@
 
 import java.io.IOException;
 
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter; // for javadoc
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
 /**
  * A {@link TokenFilter} that applies {@link BulgarianStemmer} to stem Bulgarian
  * words.
+ * <p>
+ * To prevent terms from being stemmed use an instance of
+ * {@link KeywordMarkerTokenFilter} or a custom {@link TokenFilter} that sets
+ * the {@link KeywordAttribute} before this {@link TokenStream}.
+ * </p>
  */
 public final class BulgarianStemFilter extends TokenFilter {
   private final BulgarianStemmer stemmer;
   private final TermAttribute termAtt;
+  private final KeywordAttribute keywordAttr;
   
   public BulgarianStemFilter(final TokenStream input) {
     super(input);
     stemmer = new BulgarianStemmer();
     termAtt = addAttribute(TermAttribute.class);
+    keywordAttr = addAttribute(KeywordAttribute.class);
   }
   
   @Override
   public boolean incrementToken() throws IOException {
     if (input.incrementToken()) {
-      final int newlen = stemmer.stem(termAtt.termBuffer(), termAtt.termLength());
-      termAtt.setTermLength(newlen);
+      if(!keywordAttr.isKeyword()) {
+        final int newlen = stemmer.stem(termAtt.termBuffer(), termAtt.termLength());
+        termAtt.setTermLength(newlen);
+      }
       return true;
     } else {
       return false;

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -30,6 +30,7 @@
 import org.apache.lucene.analysis.CharArraySet;
 import org.apache.lucene.analysis.LowerCaseFilter;
 import org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents; // javadoc @link
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.StopwordAnalyzerBase;
 import org.apache.lucene.analysis.TokenStream;
@@ -204,8 +205,9 @@
     TokenStream result = new LowerCaseFilter(matchVersion, source);
     result = new StandardFilter(result);
     result = new StopFilter(matchVersion, result, stopwords);
-    return new TokenStreamComponents(source, new BrazilianStemFilter(result,
-        excltable));
+    if(excltable != null && !excltable.isEmpty())
+      result = new KeywordMarkerTokenFilter(result, excltable);
+    return new TokenStreamComponents(source, new BrazilianStemFilter(result));
   }
 }
 

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianStemFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianStemFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianStemFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/br/BrazilianStemFilter.java Fri Feb 26 13:09:54 2010
@@ -20,13 +20,21 @@
 import java.io.IOException;
 import java.util.Set;
 
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter; // for javadoc
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
 /**
  * A {@link TokenFilter} that applies {@link BrazilianStemmer}.
- *
+ * <p>
+ * To prevent terms from being stemmed use an instance of
+ * {@link KeywordMarkerTokenFilter} or a custom {@link TokenFilter} that sets
+ * the {@link KeywordAttribute} before this {@link TokenStream}.
+ * </p>
+ * @see KeywordMarkerTokenFilter
+ * 
  */
 public final class BrazilianStemFilter extends TokenFilter {
 
@@ -34,16 +42,31 @@
    * {@link BrazilianStemmer} in use by this filter.
    */
   private BrazilianStemmer stemmer = null;
-  private Set exclusions = null;
-  private TermAttribute termAtt;
-  
+  private Set<?> exclusions = null;
+  private final TermAttribute termAtt;
+  private final KeywordAttribute keywordAttr;
+
+  /**
+   * Creates a new BrazilianStemFilter 
+   * 
+   * @param in the source {@link TokenStream} 
+   */
   public BrazilianStemFilter(TokenStream in) {
     super(in);
     stemmer = new BrazilianStemmer();
     termAtt = addAttribute(TermAttribute.class);
+    keywordAttr = addAttribute(KeywordAttribute.class);
   }
-
-  public BrazilianStemFilter(TokenStream in, Set exclusiontable) {
+  
+  /**
+   * Creates a new BrazilianStemFilter 
+   * 
+   * @param in the source {@link TokenStream} 
+   * @param exclusiontable a set of terms that should be prevented from being stemmed.
+   * @deprecated use {@link KeywordAttribute} with {@link KeywordMarkerTokenFilter} instead.
+   */
+  @Deprecated
+  public BrazilianStemFilter(TokenStream in, Set<?> exclusiontable) {
     this(in);
     this.exclusions = exclusiontable;
   }
@@ -51,10 +74,10 @@
   @Override
   public boolean incrementToken() throws IOException {
     if (input.incrementToken()) {
-      String term = termAtt.term();
+      final String term = termAtt.term();
       // Check the exclusion table.
-      if (exclusions == null || !exclusions.contains(term)) {
-        String s = stemmer.stem(term);
+      if (!keywordAttr.isKeyword() && (exclusions == null || !exclusions.contains(term))) {
+        final String s = stemmer.stem(term);
         // If not stemmed, don't waste the time adjusting the token.
         if ((s != null) && !s.equals(term))
           termAtt.setTermBuffer(s);

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/CJKTokenizer.java Fri Feb 26 13:09:54 2010
@@ -175,9 +175,13 @@
                         length = 0;
                         preIsTokened = false;
                     }
+                    else{
+                      offset--;
+                    }
 
                     break;
                 } else {
+                    offset--;
                     return false;
                 }
             } else {
@@ -288,6 +292,7 @@
           typeAtt.setType(TOKEN_TYPE_NAMES[tokenType]);
           return true;
         } else if (dataLen == -1) {
+          offset--;
           return false;
         }
 
@@ -299,7 +304,7 @@
     @Override
     public final void end() {
       // set final offset
-      final int finalOffset = offset;
+      final int finalOffset = correctOffset(offset);
       this.offsetAtt.setOffset(finalOffset, finalOffset);
     }
     

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -21,15 +21,17 @@
 
 import org.apache.lucene.analysis.ReusableAnalyzerBase;
 import org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents; // javadoc @link
+import org.apache.lucene.analysis.standard.StandardAnalyzer; // javadoc @link
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.Tokenizer;
 
 /**
  * An {@link Analyzer} that tokenizes text with {@link ChineseTokenizer} and
  * filters with {@link ChineseFilter}
- *
+ * @deprecated Use {@link StandardAnalyzer} instead, which has the same functionality.
+ * This analyzer will be removed in Lucene 4.0
  */
-
+@Deprecated
 public final class ChineseAnalyzer extends ReusableAnalyzerBase {
 
   /**

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseFilter.java Fri Feb 26 13:09:54 2010
@@ -23,6 +23,7 @@
 import org.apache.lucene.analysis.CharArraySet;
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 import org.apache.lucene.util.Version;
 
@@ -41,9 +42,10 @@
  * </ol>
  * 
  * @version 1.0
- *
+ * @deprecated Use {@link StopFilter} instead, which has the same functionality.
+ * This filter will be removed in Lucene 4.0
  */
-
+@Deprecated
 public final class ChineseFilter extends TokenFilter {
 
 

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java Fri Feb 26 13:09:54 2010
@@ -21,6 +21,7 @@
 import java.io.IOException;
 import java.io.Reader;
 
+import org.apache.lucene.analysis.standard.StandardTokenizer;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
@@ -52,9 +53,10 @@
  * CJKTokenizer will not work.
  * </p>
  * @version 1.0
- *
+ * @deprecated Use {@link StandardTokenizer} instead, which has the same functionality.
+ * This filter will be removed in Lucene 4.0
  */
-
+@Deprecated
 public final class ChineseTokenizer extends Tokenizer {
 
 
@@ -129,8 +131,10 @@
                 bufferIndex = 0;
             }
 
-            if (dataLen == -1) return flush();
-            else
+            if (dataLen == -1) {
+              offset--;
+              return flush();
+            } else
                 c = ioBuffer[bufferIndex++];
 
 
@@ -162,7 +166,7 @@
     @Override
     public final void end() {
       // set final offset
-      final int finalOffset = offset;
+      final int finalOffset = correctOffset(offset);
       this.offsetAtt.setOffset(finalOffset, finalOffset);
     }
 

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html Fri Feb 26 13:09:54 2010
@@ -24,14 +24,14 @@
 <p>
 Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
 <ul>
-	<li>ChineseAnalyzer (in this package): Index unigrams (individual Chinese characters) as a token.
+	<li>StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
 	<li>CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
 	<li>SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
 </ul>
 
 Example phrase: "我是中国人"
 <ol>
-	<li>ChineseAnalyzer: 我-是-中-国-人</li>
+	<li>StandardAnalyzer: 我-是-中-国-人</li>
 	<li>CJKAnalyzer: 我是-是中-中国-国人</li>
 	<li>SmartChineseAnalyzer: 我-是-中国-人</li>
 </ol>

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/CompoundWordTokenFilterBase.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/CompoundWordTokenFilterBase.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/CompoundWordTokenFilterBase.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/CompoundWordTokenFilterBase.java Fri Feb 26 13:09:54 2010
@@ -188,6 +188,7 @@
   }
   
   private final void setToken(final Token token) throws IOException {
+    clearAttributes();
     termAtt.setTermBuffer(token.termBuffer(), 0, token.termLength());
     flagsAtt.setFlags(token.getFlags());
     typeAtt.setType(token.type());

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/hyphenation/PatternParser.java Fri Feb 26 13:09:54 2010
@@ -15,8 +15,6 @@
  * limitations under the License.
  */
 
-/* $Id: PatternParser.java 426576 2006-07-28 15:44:37Z jeremias $ */
-
 package org.apache.lucene.analysis.compound.hyphenation;
 
 // SAX

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -21,6 +21,7 @@
 import org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents; // javadoc @link
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;
 import org.apache.lucene.analysis.LowerCaseFilter;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.TokenStream;
@@ -105,6 +106,7 @@
 	// TODO once loadStopWords is gone those member should be removed too in favor of StopwordAnalyzerBase
 	private Set<?> stoptable;
   private final Version matchVersion;
+  private final Set<?> stemExclusionTable;
 
   /**
    * Builds an analyzer with the default stop words ({@link #CZECH_STOP_WORDS}).
@@ -124,8 +126,22 @@
    * @param stopwords a stopword set
    */
   public CzechAnalyzer(Version matchVersion, Set<?> stopwords) {
+    this(matchVersion, stopwords, CharArraySet.EMPTY_SET);
+  }
+  
+  /**
+   * Builds an analyzer with the given stop words and a set of work to be
+   * excluded from the {@link CzechStemFilter}.
+   * 
+   * @param matchVersion Lucene version to match See
+   *          {@link <a href="#version">above</a>}
+   * @param stopwords a stopword set
+   * @param stemExclusionTable a stemming exclusion set
+   */
+  public CzechAnalyzer(Version matchVersion, Set<?> stopwords, Set<?> stemExclusionTable) {
     this.matchVersion = matchVersion;
     this.stoptable = CharArraySet.unmodifiableSet(CharArraySet.copy(matchVersion, stopwords));
+    this.stemExclusionTable = CharArraySet.unmodifiableSet(CharArraySet.copy(matchVersion, stemExclusionTable));
   }
 
 
@@ -207,7 +223,9 @@
    * @return {@link TokenStreamComponents} built from a {@link StandardTokenizer}
    *         filtered with {@link StandardFilter}, {@link LowerCaseFilter},
    *         {@link StopFilter}, and {@link CzechStemFilter} (only if version is
-   *         >= LUCENE_31)
+   *         >= LUCENE_31). If a version is >= LUCENE_31 and a stem exclusion set
+   *         is provided via {@link #CzechAnalyzer(Version, Set, Set)} a 
+   *         {@link KeywordMarkerTokenFilter} is added before {@link CzechStemFilter}.
    */
   @Override
   protected TokenStreamComponents createComponents(String fieldName,
@@ -216,8 +234,11 @@
     TokenStream result = new StandardFilter(source);
     result = new LowerCaseFilter(matchVersion, result);
     result = new StopFilter( matchVersion, result, stoptable);
-    if (matchVersion.onOrAfter(Version.LUCENE_31))
+    if (matchVersion.onOrAfter(Version.LUCENE_31)) {
+      if(!this.stemExclusionTable.isEmpty())
+        result = new KeywordMarkerTokenFilter(result, stemExclusionTable);
       result = new CzechStemFilter(result);
+    }
     return new TokenStreamComponents(source, result);
   }
 }

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechStemFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechStemFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechStemFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/cz/CzechStemFilter.java Fri Feb 26 13:09:54 2010
@@ -2,8 +2,10 @@
 
 import java.io.IOException;
 
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;// for javadoc
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
 /**
@@ -25,25 +27,34 @@
 
 /**
  * A {@link TokenFilter} that applies {@link CzechStemmer} to stem Czech words.
- * 
+ * <p>
+ * To prevent terms from being stemmed use an instance of
+ * {@link KeywordMarkerTokenFilter} or a custom {@link TokenFilter} that sets
+ * the {@link KeywordAttribute} before this {@link TokenStream}.
+ * </p>
  * <p><b>NOTE</b>: Input is expected to be in lowercase, 
  * but with diacritical marks</p>
+ * @see KeywordMarkerTokenFilter
  */
 public final class CzechStemFilter extends TokenFilter {
   private final CzechStemmer stemmer;
   private final TermAttribute termAtt;
+  private final KeywordAttribute keywordAttr;
   
   public CzechStemFilter(TokenStream input) {
     super(input);
     stemmer = new CzechStemmer();
     termAtt = addAttribute(TermAttribute.class);
+    keywordAttr = addAttribute(KeywordAttribute.class);
   }
 
   @Override
   public boolean incrementToken() throws IOException {
     if (input.incrementToken()) {
-      int newlen = stemmer.stem(termAtt.termBuffer(), termAtt.termLength());
-      termAtt.setTermLength(newlen);
+      if(!keywordAttr.isKeyword()) {
+        final int newlen = stemmer.stem(termAtt.termBuffer(), termAtt.termLength());
+        termAtt.setTermLength(newlen);
+      }
       return true;
     } else {
       return false;

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -30,15 +30,18 @@
 import org.apache.lucene.analysis.CharArraySet;
 import org.apache.lucene.analysis.LowerCaseFilter;
 import org.apache.lucene.analysis.ReusableAnalyzerBase.TokenStreamComponents; // javadoc @link
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.StopwordAnalyzerBase;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.WordlistLoader;
+import org.apache.lucene.analysis.snowball.SnowballFilter;
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.analysis.standard.StandardFilter;
 import org.apache.lucene.analysis.standard.StandardTokenizer;
 import org.apache.lucene.util.Version;
+import org.tartarus.snowball.ext.German2Stemmer;
 
 /**
  * {@link Analyzer} for German language. 
@@ -50,6 +53,16 @@
  * exclusion list is empty by default.
  * </p>
  * 
+ * <a name="version"/>
+ * <p>You must specify the required {@link Version}
+ * compatibility when creating GermanAnalyzer:
+ * <ul>
+ *   <li> As of 3.1, Snowball stemming is done with SnowballFilter, and 
+ *        Snowball stopwords are used by default.
+ *   <li> As of 2.9, StopFilter preserves position
+ *        increments
+ * </ul>
+ * 
  * <p><b>NOTE</b>: This class uses the same {@link Version}
  * dependent settings as {@link StandardAnalyzer}.</p>
  */
@@ -59,7 +72,7 @@
    * List of typical german stopwords.
    * @deprecated use {@link #getDefaultStopSet()} instead
    */
-  //TODO make this private in 3.1
+  //TODO make this private in 3.1, remove in 4.0
   @Deprecated
   public final static String[] GERMAN_STOP_WORDS = {
     "einer", "eine", "eines", "einem", "einen",
@@ -76,6 +89,9 @@
     "durch", "wegen", "wird"
   };
   
+  /** File containing default German stopwords. */
+  public final static String DEFAULT_STOPWORD_FILE = "german_stop.txt";
+  
   /**
    * Returns a set of default German-stopwords 
    * @return a set of default German-stopwords 
@@ -85,8 +101,21 @@
   }
   
   private static class DefaultSetHolder {
-    private static final Set<?> DEFAULT_SET = CharArraySet.unmodifiableSet(new CharArraySet(
+    /** @deprecated remove in Lucene 4.0 */
+    @Deprecated
+    private static final Set<?> DEFAULT_SET_30 = CharArraySet.unmodifiableSet(new CharArraySet(
         Version.LUCENE_CURRENT, Arrays.asList(GERMAN_STOP_WORDS), false));
+    private static final Set<?> DEFAULT_SET;
+    static {
+      try {
+        DEFAULT_SET = 
+          WordlistLoader.getSnowballWordSet(SnowballFilter.class, DEFAULT_STOPWORD_FILE);
+      } catch (IOException ex) {
+        // default set should always be present as it is part of the
+        // distribution (JAR)
+        throw new RuntimeException("Unable to load default stopword set");
+      }
+    }
   }
 
   /**
@@ -104,7 +133,9 @@
    * {@link #getDefaultStopSet()}.
    */
   public GermanAnalyzer(Version matchVersion) {
-    this(matchVersion, DefaultSetHolder.DEFAULT_SET);
+    this(matchVersion,
+        matchVersion.onOrAfter(Version.LUCENE_31) ? DefaultSetHolder.DEFAULT_SET
+            : DefaultSetHolder.DEFAULT_SET_30);
   }
   
   /**
@@ -198,8 +229,9 @@
    * 
    * @return {@link TokenStreamComponents} built from a
    *         {@link StandardTokenizer} filtered with {@link StandardFilter},
-   *         {@link LowerCaseFilter}, {@link StopFilter}, and
-   *         {@link GermanStemFilter}
+   *         {@link LowerCaseFilter}, {@link StopFilter}, 
+   *         {@link KeywordMarkerTokenFilter} if a stem exclusion set is provided, and
+   *         {@link SnowballFilter}
    */
   @Override
   protected TokenStreamComponents createComponents(String fieldName,
@@ -208,6 +240,11 @@
     TokenStream result = new StandardFilter(source);
     result = new LowerCaseFilter(matchVersion, result);
     result = new StopFilter( matchVersion, result, stopwords);
-    return new TokenStreamComponents(source, new GermanStemFilter(result, exclusionSet));
+    result = new KeywordMarkerTokenFilter(result, exclusionSet);
+    if (matchVersion.onOrAfter(Version.LUCENE_31))
+      result = new SnowballFilter(result, new German2Stemmer());
+    else
+      result = new GermanStemFilter(result);
+    return new TokenStreamComponents(source, result);
   }
 }

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanStemFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanStemFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanStemFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanStemFilter.java Fri Feb 26 13:09:54 2010
@@ -20,8 +20,10 @@
 import java.io.IOException;
 import java.util.Set;
 
+import org.apache.lucene.analysis.KeywordMarkerTokenFilter;// for javadoc
 import org.apache.lucene.analysis.TokenFilter;
 import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
 /**
@@ -31,6 +33,12 @@
  * not be stemmed at all. The stemmer used can be changed at runtime after the
  * filter object is created (as long as it is a {@link GermanStemmer}).
  * </p>
+ * <p>
+ * To prevent terms from being stemmed use an instance of
+ * {@link KeywordMarkerTokenFilter} or a custom {@link TokenFilter} that sets
+ * the {@link KeywordAttribute} before this {@link TokenStream}.
+ * </p>
+ * @see KeywordMarkerTokenFilter
  */
 public final class GermanStemFilter extends TokenFilter
 {
@@ -38,21 +46,29 @@
      * The actual token in the input stream.
      */
     private GermanStemmer stemmer = null;
-    private Set exclusionSet = null;
+    private Set<?> exclusionSet = null;
 
-    private TermAttribute termAtt;
+    private final TermAttribute termAtt;
+    private final KeywordAttribute keywordAttr;
 
+    /**
+     * Creates a {@link GermanStemFilter} instance
+     * @param in the source {@link TokenStream} 
+     */
     public GermanStemFilter( TokenStream in )
     {
       super(in);
       stemmer = new GermanStemmer();
       termAtt = addAttribute(TermAttribute.class);
+      keywordAttr = addAttribute(KeywordAttribute.class);
     }
 
     /**
      * Builds a GermanStemFilter that uses an exclusion table.
+     * @deprecated use {@link KeywordAttribute} with {@link KeywordMarkerTokenFilter} instead.
      */
-    public GermanStemFilter( TokenStream in, Set exclusionSet )
+    @Deprecated
+    public GermanStemFilter( TokenStream in, Set<?> exclusionSet )
     {
       this( in );
       this.exclusionSet = exclusionSet;
@@ -66,7 +82,7 @@
       if (input.incrementToken()) {
         String term = termAtt.term();
         // Check the exclusion table.
-        if (exclusionSet == null || !exclusionSet.contains(term)) {
+        if (!keywordAttr.isKeyword() && (exclusionSet == null || !exclusionSet.contains(term))) {
           String s = stemmer.stem(term);
           // If not stemmed, don't waste the time adjusting the token.
           if ((s != null) && !s.equals(term))
@@ -91,8 +107,10 @@
 
     /**
      * Set an alternative exclusion list for this filter.
+     * @deprecated use {@link KeywordAttribute} with {@link KeywordMarkerTokenFilter} instead.
      */
-    public void setExclusionSet( Set exclusionSet )
+    @Deprecated
+    public void setExclusionSet( Set<?> exclusionSet )
     {
       this.exclusionSet = exclusionSet;
     }

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/el/GreekAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/el/GreekAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/el/GreekAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/el/GreekAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -24,6 +24,7 @@
 import org.apache.lucene.analysis.StopwordAnalyzerBase;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.standard.StandardFilter;
 import org.apache.lucene.analysis.standard.StandardTokenizer;
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  // for javadoc
 import org.apache.lucene.util.Version;
@@ -41,6 +42,15 @@
  * A default set of stopwords is used unless an alternative list is specified.
  * </p>
  *
+ * <a name="version"/>
+ * <p>You must specify the required {@link Version}
+ * compatibility when creating GreekAnalyzer:
+ * <ul>
+ *   <li> As of 3.1, StandardFilter is used by default.
+ *   <li> As of 2.9, StopFilter preserves position
+ *        increments
+ * </ul>
+ * 
  * <p><b>NOTE</b>: This class uses the same {@link Version}
  * dependent settings as {@link StandardAnalyzer}.</p>
  */
@@ -117,13 +127,15 @@
     * 
     * @return {@link TokenStreamComponents} built from a
     *         {@link StandardTokenizer} filtered with
-    *         {@link GreekLowerCaseFilter} and {@link StopFilter}
+    *         {@link GreekLowerCaseFilter}, {@link StandardFilter} and {@link StopFilter}
     */
     @Override
     protected TokenStreamComponents createComponents(String fieldName,
         Reader reader) {
       final Tokenizer source = new StandardTokenizer(matchVersion, reader);
-      final TokenStream result = new GreekLowerCaseFilter(source);
+      TokenStream result = new GreekLowerCaseFilter(source);
+      if (matchVersion.onOrAfter(Version.LUCENE_31))
+        result = new StandardFilter(result);
       return new TokenStreamComponents(source, new StopFilter(matchVersion, result, stopwords));
     }
 }

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java Fri Feb 26 13:09:54 2010
@@ -147,7 +147,7 @@
   @Override
   protected TokenStreamComponents createComponents(String fieldName,
       Reader reader) {
-    final Tokenizer source = new ArabicLetterTokenizer(reader);
+    final Tokenizer source = new ArabicLetterTokenizer(matchVersion, reader);
     TokenStream result = new LowerCaseFilter(matchVersion, source);
     result = new ArabicNormalizationFilter(result);
     /* additional persian-specific normalization */

Modified: lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fr/ElisionFilter.java
URL: http://svn.apache.org/viewvc/lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fr/ElisionFilter.java?rev=916666&r1=916665&r2=916666&view=diff
==============================================================================
--- lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fr/ElisionFilter.java (original)
+++ lucene/java/branches/flex_1458/contrib/analyzers/common/src/java/org/apache/lucene/analysis/fr/ElisionFilter.java Fri Feb 26 13:09:54 2010
@@ -68,7 +68,7 @@
   /**
    * Constructs an elision filter with standard stop words
    */
-  protected ElisionFilter(Version matchVersion, TokenStream input) {
+  public ElisionFilter(Version matchVersion, TokenStream input) {
     this(matchVersion, input, DEFAULT_ARTICLES);
   }
 
@@ -77,7 +77,7 @@
    * @deprecated use {@link #ElisionFilter(Version, TokenStream)} instead
    */
   @Deprecated
-  protected ElisionFilter(TokenStream input) {
+  public ElisionFilter(TokenStream input) {
     this(Version.LUCENE_30, input);
   }
 



Mime
View raw message