lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cassandra Targett (Confluence)" <>
Subject [CONF] Apache Solr Reference Guide > Internal - TODO List
Date Thu, 26 Sep 2013 18:44:00 GMT
Space: Apache Solr Reference Guide (
Page: Internal - TODO List (

Edited by Cassandra Targett:
This page serves as a place to organize thoughts and collect lists of things to try and fix
/ clean up.  

If you are in the process of looking into something, or imminently plan to start looking into
something -- please put your name under it.  If you complete something on this list, please
delete it.

{note}a lot of these are from an email sarowe sent in response to the "VOTE RC0 Release apache-solr-ref-guide-4.5.pdf"
thread, so any page numbers mentioned are likely related to that PDF{note}

* 0. All examples in the exported PDF have an extra blank line at the top.  I was able to
eliminate these from this page <>
("What is an analyzer?") by eliminating the newline between the initial \{code\} line and
the first line of the examples.  This doesn't have any apparent effect on the layout of the
page on the wiki, but the PDF export of that page no longer has the extra blank lines.  Any
objections to switching all \{code\} examples in the guide like this?
** sarowe - looking into css based fixes (done)

* 1. Pg 2: The section links from the TOC all take you to the previous page, rather than to
the top of the page where the section starts.  (Same behavior on OS X Preview, and under Windows,
on Firefox's built-in PDF viewer and on Adobe Reader.)  This looks like a general problem
- see e.g. #34.
** sarowe, ctargett, hoss, steffkes looked into it but couldn't figure out a good fix (shelved)

* 2. Pg 68: Stray asterisks in the <analyzer> tags in the <fieldType> example
under "Analysis Phases", apparently to make the surrounded text bold (which also didn't happen).
** sarowe working on it

* 3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation mark from each
of the left and right hand sides.

* 4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" link in the
sentence "Theere aren't any filters that use StandardTokenizer's types" - the link is to the
non-existent "StandardTokenizer" page on the Solr wiki.  (It might be useful to systematically
link stuff like this to the corresponding Lucene or Solr javadocs,
but this should probably be templated or scripted, so that the version-specific links are
handled properly.)

* 5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim is false, and
Internet domain name recognition isn't validation per se, e.g. "google.supercomputername"
will be tokenized as a single token along with
"".  The "Out" example output needs fixup accordingly.  I see that the "Classic
Tokenizer" section on pg72 has the verbatim email/domain text; for ClassicTokenizer, the email
claim is true, but it has the same issue with
internet domain names as StandardTokenizer.

* 6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", "icyc", "icycl",
"cycl", "cycle", "ycle") instead of all of the 4grams before the 5grams (I think this class's
behavior was changed in 4.4 by LUCENE-5042).

* 9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's "group" attribute
description, at the boundaries between the first two sentences: "token(s).The" and "tokens.Non-negative".

* 10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should be styled
with a fixed-width font.

* 11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain names, but
also domain names including any other valid top-level domain (i.e., unlike StandardTokenizer
and ClassicTokenizer, domain names are validated against the white list drawn from the IANA
Root Zone database <> as of the last time "ant
gen-tld" was performed and the tokenizer was generated.)

* 12. Pg 77: UAX29 tokenizer: "file:://" should be "file://"

* 13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle

* 14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled with a fixed-width

* 15. Pg 78: In the example demonstrating how arguments can be given to <filter> elements
via attributes, there is a stray asterisk, apparently intended to bold the surrounding text,
which also didn't work: *min="2" max="7"/>
** sarowe working on it

* 16. Pg 79: The ASCII Folding Filter's "Out" output should have the accent stripped from
the "รก" -> "a" and the ASCII
character value adjusted -> (ASCII character 97)

* 17. Pg 81: The Edge N-gram Filter's 4-6 gram size example "Out" should be ("four", "scor",
"score", "twen", "twent", "twenty") - some of these are missing.

* 18. Pg 83: The ICU Normalizer 2 Filter example should include the "name" and "mode" attributes
in the <filter>

* 19. Pg 87: Stray asterisks in both of the N-Gram Filter examples: *minGramSize="...
** sarowe working on it

* 20. Pg 87: The N-Gram Filter 3-5 gram size example "Out" output should be ("fou", "four",
"our", "sco", "scor", "score", "cor", "core", "ore") - rather than ordering by gram size,
output is now ordered first by position and then by gram size.

* 21. Pg 88: Stray asterisk in the first occurrence only example of the Pattern Replace Filter:
** sarowe working on it

* 22. Pg 89: "encoder" argument to the Phonetic Filter has surrounding double curly brackets
instead of being styled with a fixed-width font. 

* 23. Pg 90: It should be mentioned on Porter Stem Filter that it's *four times faster* than
the English Snowball stemmer - I benchmarked it at <>

* 24. Pg 90: The Position Filter Factory is deprecated and will be removed in 5.0 - this should
be mentioned.

* 25. Pg 90: The Position Filter Factory example has the wrong token position on the second
token - it should be 2 instead of 3.

* 26. Pg 90: The "testsyns.txt" file contents are missing from Remove Duplicates Token Filter.

* 27. Pg 92: Shingle Filter is missing params "minShingleSize", "outputUnigramsIfNoShingles",
and "tokenSeparator".

* 28. Pg 93: Standard Filter: as of lucene match version 3.1, this filter is a no-op.

* 29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer supported as of
Lucene/Solr 4.4 - this should be mentioned, and the example showing its use should be removed.
 All of the examples need to have their positions adjusted accordingly.  Also, all language-specific
examples later in the guide should have this arg removed.

* 30. Pg 97: Word Delimiter Filter: "-hotspot" is crossed out - the leading hyphen needs to
be escaped or something.

* 31. Pg 97: WDF: Missing period+space in the "splitOnCaseChange" arg description: "XL"Example

* 32. Pg 97: WDF: "though" -> "through" in "protected" arg description.

* 33. Pg 98: CharFilterFactories: weird wording in "Char Filters can add, change, or remove
characters without worrying about fault of Token offsets." - better: "Char Filters can add,
change, or remove characters while preserving original character offsets to support e.g. highlighting."

* 34. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled "Major Changes
from Solr 3 to Solr 4." go one page previous to the start of this section in the guide.

* 35. Pg 100: solr.HTMLStripCharFilterFactory: this is incorrect: "Inline tags, such as <b>,
<i>, or <span> will be replaced by a space."  It should be: "Inline tags, such
as <b>, <i>, or <span> will be removed - no space or newline will be substituted."

* 36. Pg 100: solr.PatternReplaceCharFilterFactory: All of the "replaceWith" column contents
are missing backslashes; some have commas that shouldn't be there; and some have curly brackets
that shouldn't be there.

{panel:title=All these are on the language analyis page, hoss working on it}
* 37. Pg 101: Dictionary Compound Word Token Filter: the content of "germanwords.txt" ("dummkopfdonaudampfschiff")
is missing spaces or newlines between words - it should be "dumm kopf donau dampf schiff"

* 38. Pg 102: Under "Unicode Collation", s/that also be used/that also *can* be used/ in "Unicode
Collation is a language-sensitive method of sorting text that also be used for advanced search

* 39. Pg 102&103: Under "Sorting Text for a Specific Language", in the sentence "You can
see a list of supported Locales _here_", the link is to a list of supported locales under
Java 5.  The equivalent Java 6 link is <>.
 Similarly, the Collator javadocs link in the
sentence "For more information, see the _Collator javadocs_", the link is to the Java 5 javadocs
- the equivalent Java 6 link is <>.
 Similarly, under "Sorting Text with Custom Rules", the RuleBasedCollator javadocs link in
the sentence "For more information, see the
_RuleBasedCollator javadocs_" is to the Java 5 javadocs - the equivalent Java 6 link is

* 40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have been deprecated
(and will be removed in 5.0) in favor of (ICU)CollationField, which will need descriptions
and examples.

* 42. Pg 106: ISO Latin Accent Filter: this class is no longer present as of Solr 4.0 - this
section should be replaced with one about ASCIIFoldingFilter.  Also, the solr.MappingCharFilterFactory
section on Pg 99 should be changed to use "mapping-FoldToASCII.txt" instead of "mapping-ISOLatin1Accent.txt".

* 43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian are missing
from the covered languages; Catalan and Irish should include ElisionFilterFactory in their
examples - there are articles lists in Lucene's

* 44. Pg 107-120: Example anlyzers for the following languages don't include a <tokenizer>
- they should include StandardTokenizer: Arabic, Bulgarian, Czech, Galician, Hindi, Indonesian,
Italian, Persian, Polish, Swedish, Spanish, and Turkish.

* 45. Pg 109-112: The Dutch, Finnish and German examples all include a stray trailing space
in their <tokenizer> class

* 46. Pg 110: Elision Filter: used for other languages besides French (e.g. Catalan, Italian,
and Irish); ElisionFilter class was moved from the package to o.a.l.analysis.util.

* 47. Pg 110: Elision Filter: "articles" arg is not required (defaults to FrenchAnalyzer.DEFAULT_ARTICLES)

* 48. Pg 110: Elision Filter: "ignoreCase" arg is missing. 

* 49. Pg 113: Italian: an example using ElisionFilterFactory should be included - there is
an articles list in
Lucene's ItalianAnalyzer.

* 50. Pg 113: Kuromoji: ", as in the following example:" should be removed from the following
sentence, since there is no following example: "You can also make discarding punctuation configurable
in the JapaneseTokenizerFactory, by setting discardPunctuation to false (to show punctuation)
or true (to discard punctuation), as in the following

* 51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras.  There should either
be an example for these here, or a pointer to another ICUTokenizerFactory example elsewhere
in the guide.

* 52. Pg 114-116: Norwegian: the Snowball stemmer isn't mentioned in the supported Norwegian
stemmers list, but the two examples erroneously include the Snowball stemmer *along with another

* 53. Pg 117: Russian: Russian Letter Tokenizer is deprecated, and it no longer supports the
"charset" arg.

* 54. Pg 117: Russian: Russian Lower Case Filter was removed in 4.0.  It should be replaced
by LowerCaseFilter in all examples.


* [The Well-Configured Solr Instance] page is out of date compared to the sub-sections that
page has -- some of the intra wiki links are broken.
** ctargett - *done*

Stop watching space:
Change email notification preferences:


View raw message