opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <damianopo...@gmail.com>
Subject Re: Is sentence detection process really needed?
Date Fri, 26 Aug 2016 16:15:47 GMT
Thanks Joern!
If i have understood you correctly ...
IF i do not need relation between sentences i can skip the sentences
detection right?

Il 26/Ago/2016 16:33, "Joern Kottmann" <kottmann@gmail.com> ha scritto:

> The name finder has the concept of "adaptive data" in the feature
> generation. The feature generators can remember things from previous
> sentences and use it to generate features based on it. Usually that can
> help with the recognition rate if you have names that are repeated.  You
> can tweak this to your data, or just pass in the entire document.
>
> Jörn
>
> On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <damianoporta@gmail.com>
> wrote:
>
> > Hi!
> > Yes I can train a good model (sure It will takes a lot of time), i have
> 30k
> > resumes. So the "data" isnt a problem.
> > I thought about many things, i am also creating a custom features
> > generator, with dictionary too (for names) and regex for Birthday,  then
> > the machine learning will look at their contexts.
> > So now i need to separate the sentences to create a custom model.
> > At this point i will not try with one per line CV.
> >
> > Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov>
> ha
> > scritto:
> >
> > Hi Damiano,
> >    I am not sure that the NameFinder will be effective as-is for you.  Do
> > you have training data (and I mean a lot of training data)?  You need to
> > consider what feature are useful in your case.  You might consider a
> > feature such as line number on the page (since people tend to put their
> > name on the top or second line), maybe the font-size.  You can add a
> > dictionary of common names and have a feature “inDictionary”. You will
> have
> > to use your domain knowledge to help you here.
> >
> >   For birthday you may want to consider using regex to pick out dates.
> > Then look at the context around the date (words before/after, remove
> > graduated or if another date just before) or maybe years before present
> > year (if you are looking at resumes, you probably won’t find any 5 year
> > olds or 200 year olds.
> >
> > Daniel Russ, Ph.D.
> > Staff Scientist, Office of Intramural Research
> > Center for Information Technology
> > National Institutes of Health
> > U.S. Department of Health and Human Services
> > 12 South Drive
> > Bethesda,  MD 20892-5624
> >
> > On Aug 26, 2016, at 5:57 AM, Damiano Porta <damianoporta@gmail.com<
> mailto:
> > damianoporta@gmail.com>> wrote:
> >
> > Hi Daniel!
> >
> > Thank you so much for your opinion.
> > It makes perfectly sense. But i am still a bit confused about the length
> of
> > the sentences.
> > In a resume there are many names, dates etc etc. So my doubt is regarding
> > the structure of the sentences because they follow specific patterns
> > sometimes.
> >
> > For example i need to extract the personal name, (Who wrote the resume)
> the
> > Birthday etc etc.
> >
> > As You know there are many names and dates inside a resume so i thought
> > about to write the entire resume as sentence to also train the "position"
> > less or more of the entities. If i "decompose" all the resume into
> > sentences i will lose this information. No?
> >
> > Damiano
> >
> > Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <druss@mail.nih.gov
> > <mailto:druss@mail.nih.gov>> ha
> > scritto:
> >
> > Hi Damiano,
> >
> >     Everyone can feel feel to correct my ignorance but I view the the
> > name finder as follows.
> >
> >     I look at it as walking down the sentence and classifying words as
> > “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
> > Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
> > John eat the stew”.  Starting with the first word in the sentence decide
> > what are the odds that the first word starts a name (given that it is the
> > first word happens to be “Did” in a sentence, with a capital but not all
> > caps) starts a person’s name.  Then go to then next word in the sentence.
> > If the first word was not in a name, what are the odds that the second
> word
> > starts a name (given that the previous word did not start a name, the
> word
> > starts with a capital (but not all capital), the word is John, and the
> > previous word is “Did”).  If it decides that we are starting a name at
> > “John”, we are now looking for the end.  What are the odds that “eat” is
> > part of the name given that [“Did”: was not part of the name, was
> > capitalized] and that [“John”: was the first word in the name, was
> > capitalized].   You are essentially classifying [Did <- OTHER] [John
> > <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did
John
> > Smith eat the stew”.  You would have [Did <- OTHER] [John
> > <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There
are
> > other features other than just word, previous word, and the shape (first
> > letter capitalized, all letters capitalized).  I think the name finder
> uses
> > part of speech also.
> >
> >
> >    So you see that it is not a name lookup table, but dependent on the
> > previous classification of words earlier in the sentence.  Therefore, you
> > must have sentences. Does that help?
> > Daniel
> >
> >
> > Daniel Russ, Ph.D.
> > Staff Scientist, Office of Intramural Research
> > Center for Information Technology
> > National Institutes of Health
> > U.S. Department of Health and Human Services
> > 12 South Drive
> > Bethesda,  MD 20892-5624
> >
> > On Aug 25, 2016, at 9:55 AM, Damiano Porta <damianoporta@gmail.com<
> mailto:
> > damianoporta@gmail.com><mailto:
> > damianoporta@gmail.com<mailto:damianoporta@gmail.com>>> wrote:
> >
> > Hello everybody!
> >
> > Could someone explain why should I separate each sentence of my documents
> > to train my models?
> > My documents are like resume/cv and the sentences can be very different.
> > For example a sentence could also be :
> >
> > 1. Name: John
> > 2. Surname: travolta
> >
> > Etc etc
> > So my question is. What is the problem if i train ny models
> > (namefinder,tokenizer) with the complete resume/cv one per line?
> >
> > Could It be a problem?
> > In this case when i will like to tokenize the resume and doing the NER i
> > will simply pass the complete resume text skiping the "sentences
> detection"
> > process.
> >
> > Thanks for your opinion in advance!
> >
> > Best
> > Damiano
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message