lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
Date Thu, 12 Nov 2015 15:37:11 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002248#comment-15002248
] 

Uwe Schindler commented on LUCENE-6874:
---------------------------------------

Here is the output of the reuters test:

{noformat}
------------> Report Sum By (any) Name and Round (28 about 33 out of 34)
Operation                                                    round   runCnt   recsPerRun 
      rec/s  elapsedSec    avgUsedMem    avgTotalMem
AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java))               0 
      1            0         0.00        0.00     9,569,344    124,256,256
AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode)) -   0 -
 -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -   9,569,344 -  124,256,256
Rounds_5                                                      0        1     24493540   360,841.19
      67.88    16,566,472    124,256,256
NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 0 -
 -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -   9,569,344 -  124,256,256
[Character.isWhitespace()] WhitespaceTokenizer                                         0 
      1      2449354   331,038.53        7.40    22,121,256    124,256,256
Seq_20000 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 0 -  -   2 -  - 2449354 - 344,131.22
-  -  14.23 -  22,121,256 -  118,489,088
NewAnalyzer(UnicodeWhitespaceTokenizer)                                                0 
      1            0         0.00        0.00    22,121,256    112,721,920
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  -  -   0 -
 -   1 -  - 2449354 - 358,302.22 -  -   6.84 -  22,121,256 -  112,721,920
NewAnalyzer(WhitespaceTokenizer)                                                      1  
     1            0         0.00        0.00    12,138,024    112,721,920
[Character.isWhitespace()] WhitespaceTokenizer -  -  -  -  -  -  -  -  -  -  -  -  -   1 -
 -   1 -  - 2449354 - 366,724.66 -  -   6.68 -  22,374,536 -  112,721,920
Seq_20000                                                      1        2      2449354   365,139.25
      13.42    27,477,352    117,702,656
NewAnalyzer(UnicodeWhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 1 -
 -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  22,374,536 -  111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer                             1 
      1      2449354   363,567.47        6.74    32,580,168    122,683,392
NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 2 -
 -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  32,580,168 -  122,683,392
[Character.isWhitespace()] WhitespaceTokenizer                                         2 
      1      2449354   365,793.59        6.70    33,461,280    122,683,392
Seq_20000 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 2 -  -   2 -  - 2449354 - 365,112.03
-  -  13.42 -  33,461,280 -  117,178,368
NewAnalyzer(UnicodeWhitespaceTokenizer)                                                2 
      1            0         0.00        0.00    33,461,280    111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  -  -   2 -
 -   1 -  - 2449354 - 364,432.97 -  -   6.72 -  33,461,280 -  111,673,344
NewAnalyzer(WhitespaceTokenizer)                                                      3  
     1            0         0.00        0.00    10,836,464    111,673,344
[Character.isWhitespace()] WhitespaceTokenizer -  -  -  -  -  -  -  -  -  -  -  -  -   3 -
 -   1 -  - 2449354 - 367,660.47 -  -   6.66 -  12,451,400 -  111,673,344
Seq_20000                                                      3        2      2449354   365,820.94
      13.39    13,235,672    111,673,344
NewAnalyzer(UnicodeWhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 3 -
 -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  12,451,400 -  111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer                             3 
      1      2449354   363,999.69        6.73    14,019,944    111,673,344
NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 4 -
 -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  14,019,944 -  111,673,344
[Character.isWhitespace()] WhitespaceTokenizer                                         4 
      1      2449354   367,329.62        6.67    15,061,368    111,673,344
Seq_20000 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 4 -  -   2 -  - 2449354 - 365,057.59
-  -  13.42 -  15,813,920 -  111,673,344
NewAnalyzer(UnicodeWhitespaceTokenizer)                                                4 
      1            0         0.00        0.00    15,061,368    111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  -  -   4 -
 -   1 -  - 2449354 - 362,813.50 -  -   6.75 -  16,566,472 -  111,673,344
{noformat}

As you see, both Tokenizers are almost same speed.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch,
LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, icu-datasucker.patch,
unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR)
but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to work around
but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message