opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <damianopo...@gmail.com>
Subject Re: Tokenizer for NER training
Date Thu, 02 Mar 2017 17:24:09 GMT
Hello Daniel, yes exactly, i do that. I am using regexes to find those
patterns.
Daniel, is this problem only related to TokenNameFinderTrainer tool? If i
train it via code should i use custom tokenizer?
If not i will follow your solution using underscores.

Thanks
Damiano

2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <druss@mail.nih.gov>:

> Hi Damiano,
>    In general this is a difficult problem (making n-grams from unigrams).
> Have you considered using RegEx to find your dates/phone numbers etc. and
> protecting them from the tokenizer (i.e. replacing the white space with
> printable (though possible not an alphanumeric character like an
> underscore)?
> Daniel
>
> On 3/2/17, 11:46 AM, "Damiano Porta" <damianoporta@gmail.com> wrote:
>
>     Hello everybody,
>
>     i have created a custom tokenizer that does not split specific
> "patterns"
>     like, emails, telephones, dates etc. I convert them into ONE single
> token.
>     The other parts of text are tokenized with the
>     SimpleTokenizer.
>
>     The problem is when i need to train a NER model. For example if my
> data has
>     dates like 2017 03 02 these will be converted into three tokens
> (whitespace
>     tokenizer) i must avoid that.
>
>     Can i specify the tokenizer using the TokenNameFinderTrainer tool?
>
>     Thanks
>     Damiano
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message