opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj B. Narayanan" <manojb.narayanan2...@gmail.com>
Subject Re: Dictionary
Date Fri, 21 Jul 2017 10:54:06 GMT
Hi Jim,
Thanks for replying. Could you be more specific please.

These are the things that I am aware of:
1. The training data can be of the form  <START:person> Pierre Vinken <END>
is a good example .
2. Currently I use a file in the below format and create a 'Dictionary'
from it.
    This is the format

<entry><token>vinayak</token></entry>
>
> <entry><token>rakesh</token></entry>
>
> <entry><token>sandeep</token></entry>
>
> <entry><token>manoj</token></entry>
>
>
And use this dictionary in the DictionaryNameFinder.

I would like to know the advantages of using this format. Is there any
other formats available?

Could you please explain more.

Thanks.
Manoj

On Fri, Jul 21, 2017 at 3:56 PM, Jim O'Regan <jaoregan@tcd.ie> wrote:

> 2017-07-19 10:48 GMT+01:00 Manoj B. Narayanan <
> manojb.narayanan2011@gmail.com>:
>
> > Hi all,
> >
> > I wanted to find out if there is any specific reason behind using XML
> > format for dictionaries for Name Finder.
> >
>
> It's not XML. There is a very superficial similarity in the use of <>, but,
> at a minimum
> <START:person> Pierre Vinken <END>
> would need to be something like
> <name type="person"> Pierre Vinken </name>
> and the whole document would need to be enclosed by a pair of tags.
>
>
> > Also, is there any source from where we can get the documentation
> regarding
> > the dictionary formats for various tools (tokenizer, pos, name finder).
> >
>
> The manual: https://opennlp.apache.org/docs/1.8.1/manual/opennlp.html
> More specifically,
> tokeniser:
> https://opennlp.apache.org/docs/1.8.1/manual/opennlp.
> html#tools.tokenizer.training
> pos:
> https://opennlp.apache.org/docs/1.8.1/manual/opennlp.
> html#tools.postagger.training
> name finder:
> https://opennlp.apache.org/docs/1.8.1/manual/opennlp.
> html#tools.namefind.training
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message