lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by SteveRowe
Date Thu, 16 Jun 2011 19:43:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by SteveRowe:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=120&rev2=121

Comment:
Added an example <analyzer> configuration for HTMLStripCharFilter; escaped some HTML
metacharacters in the HTMLStripCharFilter description; added an HTMLStripCharFilter named
character entity example

  Creates `org.apache.solr.analysis.PatternReplaceCharFilter`. Applies a regex pattern to
string in char stream, replacing match occurances with the specified replacement string.
  
  === solr.HTMLStripCharFilterFactory ===
- Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` strips HTML
from the input stream and passes the result to either `CharFilter` or `Tokenizer`.
+ Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` strips HTML
from the input stream and passes the result to either `CharFilter` or `Tokenizer`.  Like other
CharFilters, it's specified using a <charFilter> tag, and must come before the <tokenizer>.
 An example:
+ {{{
+ <analyzer>
+   <charFilter class="solr.HTMLStripCharFilterFactory"/>
+   <tokenizer class="solr.StandardTokenizerFactory"/>
+   <filter class="solr.StandardFilterFactory"/>
+ </analyzer>
+ }}}
  
  HTML stripping features:
  
   * The input need not be an HTML document as only constructs that look like HTML will be
removed.
   * Removes HTML/XML tags while keeping the content
    * Attributes within tags are also removed, and attribute quoting is optional.
-  * Removes XML processing instructions: <?foo bar?>
+  * Removes XML processing instructions: {{{<?foo bar?>}}}
   * Removes XML comments
-  * Removes XML elements starting with <! and ending with >
+  * Removes XML elements starting with {{{<!}}} and ending with {{{>}}}
-  * Removes contents of <script> and <style> elements.
+  * Removes contents of {{{<script>}}} and {{{<style>}}} elements.
    * Handles XML comments inside these elements (normal comment processing won't always work)
    * Replaces numeric character entities references like {{{&#65;}}} or {{{&#x7f;}}}
-    * The terminating ';' is optional if the entity reference is followed by whitespace.
+    * The terminating '`;`' is optional if the entity reference is followed by whitespace.
    * Replaces all [[http://www.w3.org/TR/REC-html40/sgml/entities.html|named character entity
references]].
-    * is replaced with a space instead of 0xa0
+    * {{{&nbsp;}}} is replaced with a space instead of the non-breaking space character
{{{\u00A0}}}
-    * terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega
Corp"
+    * terminating '`;`' is mandatory to avoid false matches on something like "`Alpha&Omega
Corp`"
  
  HTML stripping examples:
- ||my &lt;a href="www.foo.bar"&gt;link&lt;/a&gt; ||my link ||
+ ||{{{my <a href="www.foo.bar">link</a> }}}||`my link `||
- ||&lt;br&gt;hello&lt;!--comment--&gt; ||hello ||
- ||hello&lt;script&gt;&lt;!-- f('&lt;!--internal--&gt;&lt;/script&gt;');
--&gt;&lt;/script&gt; ||hello ||
+ ||{{{<br>hello<!--comment--> }}}||`hello `||
+ ||{{{hello<script><!-- f('<!--internal--></script>'); --></script>
}}}||`hello `||
- ||if a&lt;b then print a; ||if a&lt;b then print a; ||
+ ||{{{if a<b then print a; }}}||`if a<b then print a; `||
- ||hello &lt;td height=22 nowrap align="left"&gt; ||hello ||
+ ||{{{hello <td height=22 nowrap align="left"> }}}||`hello `||
- ||a&lt;b &amp;#65; Alpha&Omega O ||a&lt;b A Alpha&Omega O ||
+ ||{{{a<b &#65; Alpha&Omega O}}} ||`a<b A Alpha&Omega O `||
+ ||{{{M&eacute;xico}}}||`México`||
  
  
  

Mime
View raw message