stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Norcia <stefano.nor...@gmail.com>
Subject Re: Problem trying to create a new dbpedia index and site in Italian.
Date Thu, 01 Mar 2012 15:33:38 GMT
Hi Luca,
thank you for your answer, I can go on with my integration, I can always
post-filter results from the keywordLinkingEngine after the enhancement is
done or maybe remove stopwords from dbpedia datasets manually (still don't
know how). Anyway I'm really interested in testing your solution even after
the end of the early adoption project, we are a small company so we did't
even take in consideration to develop a NER engine by ourselves. Do you
know approximatively when your integration will be ready? I'm also
available if you want me to test a beta version of your NER.
Hope to hear from you soon,

Regards,
Stefano

On Thu, Mar 1, 2012 at 3:59 PM, Luca Dini <dini@celi.it> wrote:

> Dear Stefano,
> I am new as well on the list, and we are also working in the context of
> the early adoption program. If I understand correctly, the problem is that
> without an appropriate Named Entities extraction engine for Italian, I am
> afraid that the result would always be disappointing. In the context of our
> project we will integrate enhancement services of NER for Italian and
> French (and possibly keyword extraction), so, hopefully, you will be able
> to profit of the power of Stanbol. There might be some problems in terms of
> timing, as it is not clear if in the short project window, there will be
> the possibility of  feeding our integration into yours. Is the
> unavailability of Italian NER a blocking factor for you or you can go on
> with development while waiting for the integration?
>
> Cheers,
> Luca
>
>
> On 01/03/2012 14:49, Stefano Norcia wrote:
>
>> Hi all,
>>
>> My name is Stefano Norcia and I'm working on the early adoption project
>> for
>> Etcware.
>>
>> For our early adoption project (Etcware Early Adoption project) we need to
>> use a DBPedia index in Italian
>> language in the enhancement and enrichment process enabled by the Stanbol
>> engines.
>>
>> The main problem is that the NLP module does not support italian language
>> directly, so if you put an italian
>> text in the enhancement engine the dbpedia engine does not detect any
>> concept/place/people.
>>
>> We have done some experiments to perform this goal:
>>
>> First attempt was to rebuild the dbpedia index following the instructions
>> found in the stanbol/
>> entityhub/indexing/dbpedia folder. In this folder there is a shell script
>> (fetch_prepare.sh) that
>> describe how to prepare the dbpedia datasets before creating the index. We
>> followed those
>> instructions and tried to create a new index to replace the standard
>> English dbpedia index and
>> "site" starting from the italian dbpedia datasets. We are aware that the
>> italian datasets are not
>> complete and that some packages are missing (like persondata_en.nt.bz2 and
>> so on).
>> These are the packages we used to create the index (
>> http://downloads.dbpedia.org/**3.7/it/<http://downloads.dbpedia.org/3.7/it/>)
>> :
>>
>> o dbpedia_3.7.owl.bz2
>> o geo_coordinates_it.nt.bz2
>> o instance_types_it.nt.bz2
>> o labels_it.nt.bz2
>> o long_abstracts_it.nt.bz2
>> o short_abstracts_it.nt.bz2
>>
>> We are also able to create the incoming_links text file from the package
>> page_links_it.nt.bz2.
>> After rebuilding the index we replaced the DBPedia english index in
>> stanbol
>> with our custom
>> one (simply replacing the old one with the new one and restarting
>> stanbol).
>>
>> Sadly, after that, the results produced by the enhancement engines are
>> exactly the same as before,
>> neither italian concepts are detected nor possible enhancements are
>> returned from all the other
>> enhancement engines.
>>
>> As a second attempt, we decided to use the generic RDF indexer (combined
>> with the standard
>> Keyword Linking Engine) to process the italian DBPedia datasets; in this
>> case the indexing process
>> succeeded and we were able to get a lot of results testing the enhancement
>> engines with italian
>> content. This time the problem is that the results are simply too much and
>> contain also stopwords.
>>
>> For example you can find a sample text introduced for enhancement and the
>> results shown by the
>> Keyword Linking Engine in attachment.
>>
>> The terms shown in bold are clearly stopwords. I don’t know if the problem
>> is in dataset indexing,
>> or if there is a way to eliminate them after the creation of the index.
>>
>> We have also made an attempt to change the stopwords filter in the solyard
>> base index zip
>> (/stanbol/entityhub/yard/solr/
>> src/main/resources/solr/core/**default/default.solrindex.zip
>>
>> and simple.solrindex.zip) and rebuild the content hub (and dbpedia indexer
>> too with mvn
>> assembly:single in contenthub/indexer/dbpedia ) with the right stopwords.
>>
>> We've checked the generated JAR and the italian stopwords are there, as a
>> file inside the solr config
>> folder, but the results were always the same as before (still stopwords in
>> the enhancement results).
>>
>> Do you have any suggestions on how to perform these tasks?
>>
>> Thanks in advance.
>>
>> -Stefano
>>
>> PS follow an enrichment example from the rdf index we built from dpedia
>> with simplerdfindexer and dblp :
>>
>> text:
>>
>> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre
>>
>> un'istruttoria
>>
>> Il Garante Privacy ha aperto un'istruttoria in seguito alla pubblicazione
>> di notizie da parte di agenzie di stampa e quotidiani - anche on line -
>> che, nel riferire di un caso di una infermiera in servizio presso il
>> reparto di neonatologia del Policlinico Gemelli, risultata positiva ai
>> test
>> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
>> cognome e l'età.
>>
>> Il diritto-dovere dei giornalisti di informare sugli sviluppi della
>> vicenda, di sicura rilevanza per l'opinione pubblica, considerato
>> l'elevato
>> numero di neonati e di famiglie coinvolte, deve essere comunque
>> bilanciato,
>> secondo i principi stabiliti dal Codice deontologico con il rispetto delle
>> persone.
>>
>> Il Garante ricorda che, anche quando questi dettagli fossero stati forniti
>> in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
>> scrupolo l'interesse pubblico delle singole informazioni diffuse.
>>
>> I media evitino dunque di riportare informazioni non essenziali che
>> possano
>> ledere la riservatezza delle persone e nello stesso tempo possano indurre
>> ulteriori stati di allarme e di preoccupazione in coloro che si sono
>> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
>> contatto con la persona.
>>
>> Roma, 24 agosto 2011*
>>
>>
>> Enrichments :
>>
>> 2011 2011
>>
>> Agosto Agosto
>>
>> *Alla Alla*
>>
>> *Anché Anché*
>>
>> *Che? Che?*
>>
>> Cognome Cognome
>>
>> *CON CON*
>>
>> *Dal' Dal'*
>>
>>
>> Problema dei servizi Problema dei servizi
>>
>> *Dell Dell*
>>
>>
>> Diritto Diritto
>>
>> Donna Donna
>>
>> Essere Essere
>>
>> Il nome della rosa Il nome della rosa
>>
>> Informazione Informazione
>>
>> Interesse pubblico Interesse pubblico
>>
>> Media Media
>>
>> Mezzi di produzione Mezzi di produzione
>>
>> *Nello Nello*
>>
>> Neonatologia Neonatologia
>>
>> *NON NON*
>>
>>
>> Numero di coordinazione (chimica) Numero di coordinazione (chimica)
>>
>> Opinione pubblica Opinione pubblica
>>
>> Ospedale Ospedale
>>
>> *PER PER*
>>
>>
>> Persona Persona
>>
>> Privacy Privacy
>>
>> Pubblicazione di matrimonio Pubblicazione di matrimonio
>>
>> Secondo Secondo
>>
>> Servizio Servizio
>>
>> Stampa Stampa
>>
>> Stati di immaginazione Stati di immaginazione
>>
>> *SUI SUI*
>>
>>
>> TBC TBC
>>
>> Tempo Tempo
>>
>> .test .test
>>
>> Tubercolosi Tubercolosi
>>
>> *UNA UNA*
>>
>>
>>
>> The ones in bold are stopwords, the other results are good ones but anyway
>> the stopwords where not eliminated in dataset indexing, or maybe there is
>> a
>> way to eliminate them from the datasets but I don't know how.
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message