incubator-stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Dini <d...@celi.it>
Subject Re: Problem trying to create a new dbpedia index and site in Italian.
Date Thu, 01 Mar 2012 14:59:29 GMT
Dear Stefano,
I am new as well on the list, and we are also working in the context of 
the early adoption program. If I understand correctly, the problem is 
that without an appropriate Named Entities extraction engine for 
Italian, I am afraid that the result would always be disappointing. In 
the context of our project we will integrate enhancement services of NER 
for Italian and French (and possibly keyword extraction), so, hopefully, 
you will be able to profit of the power of Stanbol. There might be some 
problems in terms of timing, as it is not clear if in the short project 
window, there will be the possibility of  feeding our integration into 
yours. Is the unavailability of Italian NER a blocking factor for you or 
you can go on with development while waiting for the integration?

Cheers,
Luca

On 01/03/2012 14:49, Stefano Norcia wrote:
> Hi all,
>
> My name is Stefano Norcia and I'm working on the early adoption project for
> Etcware.
>
> For our early adoption project (Etcware Early Adoption project) we need to
> use a DBPedia index in Italian
> language in the enhancement and enrichment process enabled by the Stanbol
> engines.
>
> The main problem is that the NLP module does not support italian language
> directly, so if you put an italian
> text in the enhancement engine the dbpedia engine does not detect any
> concept/place/people.
>
> We have done some experiments to perform this goal:
>
> First attempt was to rebuild the dbpedia index following the instructions
> found in the stanbol/
> entityhub/indexing/dbpedia folder. In this folder there is a shell script
> (fetch_prepare.sh) that
> describe how to prepare the dbpedia datasets before creating the index. We
> followed those
> instructions and tried to create a new index to replace the standard
> English dbpedia index and
> "site" starting from the italian dbpedia datasets. We are aware that the
> italian datasets are not
> complete and that some packages are missing (like persondata_en.nt.bz2 and
> so on).
> These are the packages we used to create the index (
> http://downloads.dbpedia.org/3.7/it/) :
>
> o dbpedia_3.7.owl.bz2
> o geo_coordinates_it.nt.bz2
> o instance_types_it.nt.bz2
> o labels_it.nt.bz2
> o long_abstracts_it.nt.bz2
> o short_abstracts_it.nt.bz2
>
> We are also able to create the incoming_links text file from the package
> page_links_it.nt.bz2.
> After rebuilding the index we replaced the DBPedia english index in stanbol
> with our custom
> one (simply replacing the old one with the new one and restarting stanbol).
>
> Sadly, after that, the results produced by the enhancement engines are
> exactly the same as before,
> neither italian concepts are detected nor possible enhancements are
> returned from all the other
> enhancement engines.
>
> As a second attempt, we decided to use the generic RDF indexer (combined
> with the standard
> Keyword Linking Engine) to process the italian DBPedia datasets; in this
> case the indexing process
> succeeded and we were able to get a lot of results testing the enhancement
> engines with italian
> content. This time the problem is that the results are simply too much and
> contain also stopwords.
>
> For example you can find a sample text introduced for enhancement and the
> results shown by the
> Keyword Linking Engine in attachment.
>
> The terms shown in bold are clearly stopwords. I don’t know if the problem
> is in dataset indexing,
> or if there is a way to eliminate them after the creation of the index.
>
> We have also made an attempt to change the stopwords filter in the solyard
> base index zip
> (/stanbol/entityhub/yard/solr/
> src/main/resources/solr/core/default/default.solrindex.zip
>
> and simple.solrindex.zip) and rebuild the content hub (and dbpedia indexer
> too with mvn
> assembly:single in contenthub/indexer/dbpedia ) with the right stopwords.
>
> We've checked the generated JAR and the italian stopwords are there, as a
> file inside the solr config
> folder, but the results were always the same as before (still stopwords in
> the enhancement results).
>
> Do you have any suggestions on how to perform these tasks?
>
> Thanks in advance.
>
> -Stefano
>
> PS follow an enrichment example from the rdf index we built from dpedia
> with simplerdfindexer and dblp :
>
> text:
>
> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre
> un'istruttoria
>
> Il Garante Privacy ha aperto un'istruttoria in seguito alla pubblicazione
> di notizie da parte di agenzie di stampa e quotidiani - anche on line -
> che, nel riferire di un caso di una infermiera in servizio presso il
> reparto di neonatologia del Policlinico Gemelli, risultata positiva ai test
> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
> cognome e l'età.
>
> Il diritto-dovere dei giornalisti di informare sugli sviluppi della
> vicenda, di sicura rilevanza per l'opinione pubblica, considerato l'elevato
> numero di neonati e di famiglie coinvolte, deve essere comunque bilanciato,
> secondo i principi stabiliti dal Codice deontologico con il rispetto delle
> persone.
>
> Il Garante ricorda che, anche quando questi dettagli fossero stati forniti
> in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
> scrupolo l'interesse pubblico delle singole informazioni diffuse.
>
> I media evitino dunque di riportare informazioni non essenziali che possano
> ledere la riservatezza delle persone e nello stesso tempo possano indurre
> ulteriori stati di allarme e di preoccupazione in coloro che si sono
> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
> contatto con la persona.
>
> Roma, 24 agosto 2011*
>
> Enrichments :
>
> 2011 2011
>
> Agosto Agosto
>
> *Alla Alla*
>
> *Anché Anché*
>
> *Che? Che?*
>
> Cognome Cognome
>
> *CON CON*
>
> *Dal' Dal'*
>
> Problema dei servizi Problema dei servizi
>
> *Dell Dell*
>
> Diritto Diritto
>
> Donna Donna
>
> Essere Essere
>
> Il nome della rosa Il nome della rosa
>
> Informazione Informazione
>
> Interesse pubblico Interesse pubblico
>
> Media Media
>
> Mezzi di produzione Mezzi di produzione
>
> *Nello Nello*
>
> Neonatologia Neonatologia
>
> *NON NON*
>
> Numero di coordinazione (chimica) Numero di coordinazione (chimica)
>
> Opinione pubblica Opinione pubblica
>
> Ospedale Ospedale
>
> *PER PER*
>
> Persona Persona
>
> Privacy Privacy
>
> Pubblicazione di matrimonio Pubblicazione di matrimonio
>
> Secondo Secondo
>
> Servizio Servizio
>
> Stampa Stampa
>
> Stati di immaginazione Stati di immaginazione
>
> *SUI SUI*
>
> TBC TBC
>
> Tempo Tempo
>
> .test .test
>
> Tubercolosi Tubercolosi
>
> *UNA UNA*
>
>
> The ones in bold are stopwords, the other results are good ones but anyway
> the stopwords where not eliminated in dataset indexing, or maybe there is a
> way to eliminate them from the datasets but I don't know how.
>


Mime
View raw message