lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parit Bansal <Parit.Ban...@sib.swiss>
Subject Dubious tokenizing with WordDelimiterGraphFilter
Date Mon, 22 Jan 2018 10:28:05 GMT
Hi,

I have a question about the tokenization performed by 
WordDelimiterGraphFilter. I am not sure if this is a bug or maybe I am 
missing some flags in setting up the GraphFilter. Please have a look.  
Lucene version used is 6.6.1

Here is a gist code for it: 
https://gist.github.com/parit/cecfd8f51c6d57a996d615ee82cb69a4#file-testanalyzer-java-L52

Input: cg7582pa

Expected tokens:  cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1> 
7582pa<pos: 1> pa <pos: 2>

Observed: cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1> pa <pos: 1>

Questions:

1. Why is the token 7582pa missing when I have set all the concatenation 
flags?

2. Shouldn't the position of the first token i.e. cg7582pa be 0 instead 
of 1 ?

3. Why is the last token i.e pa given a position of 2 and not 1 ?

Looking forward for your suggestions.

- Best

Parit Bansal




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message