lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (Confluence)" <>
Subject [CONF] Apache Solr Reference Guide > Internal - TODO List
Date Fri, 27 Sep 2013 01:26:00 GMT
Space: Apache Solr Reference Guide (
Page: Internal - TODO List (

Edited by Hoss Man:
This page serves as a place to organize thoughts and collect lists of things to try and fix
/ clean up.  

If you are in the process of looking into something, or imminently plan to start looking into
something -- please put your name under it.  If you complete something on this list, please
delete it.

{note}a lot of these are from an email sarowe sent in response to the "VOTE RC0 Release apache-solr-ref-guide-4.5.pdf"
thread, so any page numbers mentioned are likely related to that PDF{note}

* 1. Pg 2: The section links from the TOC all take you to the previous page, rather than to
the top of the page where the section starts.  (Same behavior on OS X Preview, and under Windows,
on Firefox's built-in PDF viewer and on Adobe Reader.)  This looks like a general problem
- see e.g. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled "Major
Changes from Solr 3 to Solr 4." go one page previous to the start of this section in the guide.
** sarowe, ctargett, hoss, steffkes looked into it but couldn't figure out a good fix (shelved)

{panel:title=sarowe working on these}
* 3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation mark from each
of the left and right hand sides.

* 4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" link in the
sentence "Theere aren't any filters that use StandardTokenizer's types" - the link is to the
non-existent "StandardTokenizer" page on the Solr wiki.  (It might be useful to systematically
link stuff like this to the corresponding Lucene or Solr javadocs,
but this should probably be templated or scripted, so that the version-specific links are
handled properly.)

* 5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim is false, and
Internet domain name recognition isn't validation per se, e.g. "google.supercomputername"
will be tokenized as a single token along with
"".  The "Out" example output needs fixup accordingly.  I see that the "Classic
Tokenizer" section on pg72 has the verbatim email/domain text; for ClassicTokenizer, the email
claim is true, but it has the same issue with
internet domain names as StandardTokenizer.

* 6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", "icyc", "icycl",
"cycl", "cycle", "ycle") instead of all of the 4grams before the 5grams (I think this class's
behavior was changed in 4.4 by LUCENE-5042).

* 9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's "group" attribute
description, at the boundaries between the first two sentences: "token(s).The" and "tokens.Non-negative".

* 10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should be styled
with a fixed-width font.

* 11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain names, but
also domain names including any other valid top-level domain (i.e., unlike StandardTokenizer
and ClassicTokenizer, domain names are validated against the white list drawn from the IANA
Root Zone database <> as of the last time "ant
gen-tld" was performed and the tokenizer was generated.)

* 12. Pg 77: UAX29 tokenizer: "file:://" should be "file://"

* 13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle

* 14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled with a fixed-width

* 16. Pg 79: The ASCII Folding Filter's "Out" output should have the accent stripped from
the "รก" -> "a" and the ASCII
character value adjusted -> (ASCII character 97)

* 17. Pg 81: The Edge N-gram Filter's 4-6 gram size example "Out" should be ("four", "scor",
"score", "twen", "twent", "twenty") - some of these are missing.

* 18. Pg 83: The ICU Normalizer 2 Filter example should include the "name" and "mode" attributes
in the <filter>

* 20. Pg 87: The N-Gram Filter 3-5 gram size example "Out" output should be ("fou", "four",
"our", "sco", "scor", "score", "cor", "core", "ore") - rather than ordering by gram size,
output is now ordered first by position and then by gram size.

* 22. Pg 89: "encoder" argument to the Phonetic Filter has surrounding double curly brackets
instead of being styled with a fixed-width font. 

* 23. Pg 90: It should be mentioned on Porter Stem Filter that it's *four times faster* than
the English Snowball stemmer - I benchmarked it at <>

* 24. Pg 90: The Position Filter Factory is deprecated and will be removed in 5.0 - this should
be mentioned.

* 25. Pg 90: The Position Filter Factory example has the wrong token position on the second
token - it should be 2 instead of 3.

* 26. Pg 90: The "testsyns.txt" file contents are missing from Remove Duplicates Token Filter.

* 27. Pg 92: Shingle Filter is missing params "minShingleSize", "outputUnigramsIfNoShingles",
and "tokenSeparator".

* 28. Pg 93: Standard Filter: as of lucene match version 3.1, this filter is a no-op.

* 29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer supported as of
Lucene/Solr 4.4 - this should be mentioned, and the example showing its use should be removed.
 All of the examples need to have their positions adjusted accordingly.  Also, all language-specific
examples later in the guide should have this arg removed.

* 30. Pg 97: Word Delimiter Filter: "-hotspot" is crossed out - the leading hyphen needs to
be escaped or something.

* 31. Pg 97: WDF: Missing period+space in the "splitOnCaseChange" arg description: "XL"Example

* 32. Pg 97: WDF: "though" -> "through" in "protected" arg description.

* 40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have been deprecated
(and will be removed in 5.0) in favor of (ICU)CollationField, which will need descriptions
and examples.
** i'm not sure where/how-much to say about the new (ICU)CollationField classes -- or if it
should be on [Language Analysis] or on [Field Types Included with Solr] ?

* 43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian are missing
from the covered languages; Catalan and Irish should include ElisionFilterFactory in their
examples - there are articles lists in Lucene's (Catalan,Irish)Analyzer.

* 51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras.  There should either
be an example for these here, or a pointer to another ICUTokenizerFactory example elsewhere
in the guide.

* figure out what the deal is with teh last sentence of [Running Your Analyzer]...
Refer to the section [Running Field Analysis to Test Analyzers, Tokenizers, and
TokenFilters|Schema Screen#Running Field Analysis to Test Analyzers, Tokenizers, and 
TokenFilters] for more information about conducting field analysis through the Admin Web 
...there is no anchor remotely like that on [Schema Screen], and it doesn't even make sense
that there ever might have been one ... the current page is probably the closest thin to having
a "Running Field Analysis to Test Analyzers, Tokenizers, and TokenFilters" section.

* bad links: finding bad "intra-document" links is hard, because by the time the PDF is generated
the link is thrown away (it's styled like a link, but the metadata isn't there).  Likewise
for the xml export -- best i think we can do is the export, then crawl that and look
for links to files that don't exist, or to files that do exist but down contain the anchors
listed.  I don't have anything that will do that, but by the time i figured out hte metadata
for broken intra-document links wasn't in the PDF, i had already writtne up a little script
to dump the links, and discovered some malformed _eternal_ links, so i'll work on fixing those.
** hoss - in progress

Stop watching space:
Change email notification preferences:


View raw message