opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riccardo Tasso <riccardo.ta...@gmail.com>
Subject Re: [Name Finder] Wikipedia training set and parameters tuning
Date Fri, 27 Jan 2012 18:09:27 GMT
On 26/01/2012 20:39, Olivier Grisel wrote:
> You should use the DBpedia NTriples dumps instead of parsing the
> wikipedia template as done in https://github.com/ogrisel/pignlproc .
> The type information for person, places and organization is very good.
Ok, it will be my next step.

> I don't think it's a huge problem for training but it's indeed a
> problem the performance evaluation: if you use this some held out
> folds from this dataset for performance evaluation (precision, recall,
> f1-score of the trained NameFinder model) then the fact that dataset
> itself is missing annotation will artificially increase the false
> positive rate estimate which will have an potentially great impact on
> the evaluation of the precision. The actually precision should be
> higher that what's measured.
My sentiment is that if I train model with sentences missing annotations 
these will worse the performance of my model. Isn't it?

> I think the only way to fix this issue is to manually fix the
> annotations of a small portion of the automatically generated dataset
> to add the missing annotations. I think we probably need 1000
> sentences per type to get a non ridiculous validation set.
>
> Besides performance evaluation, the missing annotation issue will also
> bias the model towards negative response, hence increasing the false
> negatives rate and decreasing the true model recall.

That's exactly what I mean. The fact is that in our interpretation of 
Wikipedia, not all the sentences are annotated. That is because not all 
the sentences containing an entity requires linking. So I'm thinking of 
using only a better subset of my sentences (since they are so much). 
 From this the idea of sampling only featured pages: stubs or poor pages 
may have a greater probability of being poorly annotated.

The idea may also be extended with the other proposal, which I'll try to 
explain with an example. Imagine a page about a vegetable. If a city 
appears in a sentence inside this page, it could be possible that it 
will appear not linked (i.e. not annotated) since the topics of the 
article aren't as much related. Otherwise I suspect that in a page 
talking about Geography, places are tagged more frequently. This is 
obviously an hypothesis, which shoul be better verify.

Another idea is to use only sentences containing links regarding the 
entities which may be interesting. For example:
* "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place> 
is an industrial city"
* "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris was 
drunk last Friday" (this sentence is kept because the link text is in 
the list of candidates to be tagged as places, but in this case the 
anchor suggest us it isn't so, hence is a good negative example)
"Paris is a very touristic city." is discarded because it doesn't 
contain any interesting link



> In my first experiment reported in [1] I had not taken the wikipedia
> redirect links into account which did probably aggravate this problem
> even further. The current version of the pig script has been fixed
> w.r.t redirect handling [2] but I have not found the time to rerun a
> complete performance evaluation. This will solve frequent
> classification errors such as "China" which is redirected to "People's
> Republic of China" in Wikipedia. So just handling the redirect my
> improve the quality of the data and hence the trained model by quite a
> bit.
>
> [1] http://dev.blogs.nuxeo.com/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
> [2] https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig#L22
>
> Also note that the perceptron model was not available when I ran this
> experiment. It's probably more scalable, e.s.p. memory wise and would
> be very worth trying again.

I my case I can handle redirects too, and surely I'll try to test also 
the perceptron model.

> In my experience the DBpedia type links for Person, Place and 
> Organization are very good quality. No false positives, there might be 
> some missing links though. It might be interesting to do some manual 
> checking of the top 100 recurring false positive names after a first 
> round of DBpedia extraction => model training => model evaluation on 
> held out data. Then if a significant portion of those false positive 
> names are actually missing type info in DBpedia or in the redirect 
> links, add them manually and iterate. 

Ok, now I have a lot of ideas for customizing my experiments. I will 
publish of course my results as soon as I execute my tests. However I'd 
like to get more in depth also with the training parameters, so the 
discussion goes on :)


> Anyway if you are interested in reviving the annotation sub-project, 
> please feel free to do so: 
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html We need a 
> database of annotated open data text (wikipedia, wikinews, project 
> Gutemberg...) with human validation metadata and a nice Web UI to 
> maintain it. 

I think it would be a great think, and also a work which requires a good 
project phase (a mistake here could bring to a lot of problems in the 
future). I'll think about contributing to the project, but for sure it 
won't be immediate.

Thanks
     Riccardo


Mime
View raw message