opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <damianopo...@gmail.com>
Subject Tokenizer for NER training
Date Thu, 02 Mar 2017 16:46:02 GMT
Hello everybody,

i have created a custom tokenizer that does not split specific "patterns"
like, emails, telephones, dates etc. I convert them into ONE single token.
The other parts of text are tokenized with the
SimpleTokenizer.

The problem is when i need to train a NER model. For example if my data has
dates like 2017 03 02 these will be converted into three tokens (whitespace
tokenizer) i must avoid that.

Can i specify the tokenizer using the TokenNameFinderTrainer tool?

Thanks
Damiano

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message