opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <>
Subject How to train a Tokenizer for emails ?
Date Mon, 29 Aug 2016 13:11:07 GMT
I am creating a custom tokenizer. It works pretty well but i have problems
with emails.
The emails can have _ - . that are tokenized in normal text, so the
question is, how can i train it better?
After the tokenization I need to apply different regexes to extract
email/dates/telephones so i must not tokenized such patterns.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message