From "Cassandra Targett (Confluence)" <>
Subject [CONF] Apache Solr Reference Guide > Internal - TODO List
Date Fri, 27 Sep 2013 18:42:00 GMT
Space: Apache Solr Reference Guide (
Page: Internal - TODO List (

Edited by Cassandra Targett:
This page serves as a place to organize thoughts and collect lists of things to try and fix
/ clean up.  

If you are in the process of looking into something, or imminently plan to start looking into
something -- please put your name under it.  If you complete something on this list, please
delete it.

{note}a lot of these are from an email sarowe sent in response to the "VOTE RC0 Release apache-solr-ref-guide-4.5.pdf"
thread, so any page numbers mentioned are likely related to that PDF{note}

* 1. Pg 2: The section links from the TOC all take you to the previous page, rather than to
the top of the page where the section starts.  (Same behavior on OS X Preview, and under Windows,
on Firefox's built-in PDF viewer and on Adobe Reader.)  This looks like a general problem
- see e.g. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled "Major
Changes from Solr 3 to Solr 4." go one page previous to the start of this section in the guide.
** sarowe, ctargett, hoss, steffkes looked into it but couldn't figure out a good fix (shelved)

{panel:title=sarowe working on these}
* 3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation mark from each
of the left and right hand sides.

* 4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" link in the
sentence "Theere aren't any filters that use StandardTokenizer's types" - the link is to the
non-existent "StandardTokenizer" page on the Solr wiki.  (It might be useful to systematically
link stuff like this to the corresponding Lucene or Solr javadocs,
but this should probably be templated or scripted, so that the version-specific links are
handled properly.)

* 5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim is false, and
Internet domain name recognition isn't validation per se, e.g. "google.supercomputername"
will be tokenized as a single token along with
"".  The "Out" example output needs fixup accordingly.  I see that the "Classic
Tokenizer" section on pg72 has the verbatim email/domain text; for ClassicTokenizer, the email
claim is true, but it has the same issue with
internet domain names as StandardTokenizer.

* 6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", "icyc", "icycl",
"cycl", "cycle", "ycle") instead of all of the 4grams before the 5grams (I think this class's
behavior was changed in 4.4 by LUCENE-5042).

* 9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's "group" attribute
description, at the boundaries between the first two sentences: "token(s).The" and "tokens.Non-negative".

* 10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should be styled
with a fixed-width font.

* 11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain names, but
also domain names including any other valid top-level domain (i.e., unlike StandardTokenizer
and ClassicTokenizer, domain names are validated against the white list drawn from the IANA
Root Zone database <> as of the last time "ant
gen-tld" was performed and the tokenizer was generated.)

* 12. Pg 77: UAX29 tokenizer: "file:://" should be "file://"

* 13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle

* 14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled with a fixed-width

* 26. Pg 90: The "testsyns.txt" file contents are missing from Remove Duplicates Token Filter.
** Hoss is working on this.

* 29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer supported as of
Lucene/Solr 4.4 - this should be mentioned, and the example showing its use should be removed.
 All of the examples need to have their positions adjusted accordingly. Also, all language-specific
examples later in the guide should have this arg removed.
** pfft. position increments confuse the heck out of me. I made a note that the argument is
no longer supported and decided to remove the example that included it entirely. Someone check
it though - [Filter Descriptions#Stop Filter]. 
** I checked the guide and see only instances of 'positionIncrementGap' - not the same?

* 40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have been deprecated
(and will be removed in 5.0) in favor of (ICU)CollationField, which will need descriptions
and examples.
** i'm not sure where/how-much to say about the new (ICU)CollationField classes -- or if it
should be on [Language Analysis] or on [Field Types Included with Solr] ?

* 43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian are missing
from the covered languages; Catalan and Irish should include ElisionFilterFactory in their
examples - there are articles lists in Lucene's (Catalan,Irish)Analyzer.

* 51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras.  There should either
be an example for these here, or a pointer to another ICUTokenizerFactory example elsewhere
in the guide.

