opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <>
Subject Tokenizer for NER training
Date Thu, 02 Mar 2017 16:46:02 GMT
Hello everybody,

i have created a custom tokenizer that does not split specific "patterns"
like, emails, telephones, dates etc. I convert them into ONE single token.
The other parts of text are tokenized with the

The problem is when i need to train a NER model. For example if my data has
dates like 2017 03 02 these will be converted into three tokens (whitespace
tokenizer) i must avoid that.

Can i specify the tokenizer using the TokenNameFinderTrainer tool?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message