incubator-stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rupert.westentha...@gmail.com>
Subject Re: Problem trying to create a new dbpedia index and site in Italian.
Date Sat, 03 Mar 2012 18:23:07 GMT
Looks good for me. 

Do not forget to

1. also define the <analyzer type="query">

2. to use the *text_it* in the <field> and <dynamicField> definitions

this includes

changing the type value of
       
    <field name="@it/dbp-ont:abstract/"  type="textgen"

to

   <field name="@it/dbp-ont:abstract/"  type="*text_it*"

and adding

     <dynamicField name="@it*"  type="*text_it*" indexed="true" stored="true" multiValued="true"
omitNorms="false"/>

Hope this improves results for Italien!

best
Rupert

On 03.03.2012, at 18:27, Stefano Norcia wrote:

> Hi Rupert,
> thank you very much for your answer, it was very helpful and let me
> understand deeper how the entity hub indexing work.
> I wrote a possible candidate for the text_it field in schema.xml for the
> various indexers in Stambol :
> 
> <fieldType name="*text_it*" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms_it.txt" ignoreCase="true" expand="false"/>
>        <filter class="solr.HyphenatedWordsFilterFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
>        <filter class="solr.LowerCaseFilterFactory"/>
> *         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="italian_stop.txt" enablePositionIncrements="true" />*
> *        <filter class="solr.SnowballPorterFilterFactory"
> language="Italian" />*
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
> 
> you can see that I used the SnowballPorterFactory as suggested in
> http://wiki.apache.org/solr/LanguageAnalysis#Italian,
> the stopwords list for Italian can be found at this link :
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt
> 
> If someone more skilled than me in Solr is reading feel free to correct or
> suggest a better definition for the field.
> 
> I'm going to test indexing with this configuration, I'll let you know about
> my progress (if any).
> 
> Regards,
> Stefano
> 
> 
> On Thu, Mar 1, 2012 at 7:47 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
> 
>> Hi Stefano, Luca
>> 
>> See my comments inline.
>> 
>> On 01.03.2012, at 15:59, Luca Dini wrote:
>> 
>>> Dear Stefano,
>>> I am new as well on the list, and we are also working in the context of
>> the early adoption program. If I understand correctly, the problem is that
>> without an appropriate Named Entities extraction engine for Italian, I am
>> afraid that the result would always be disappointing. In the context of our
>> project we will integrate enhancement services of NER for Italian and
>> French (and possibly keyword extraction), so, hopefully, you will be able
>> to profit of the power of Stanbol. There might be some problems in terms of
>> timing, as it is not clear if in the short project window, there will be
>> the possibility of  feeding our integration into yours. Is the
>> unavailability of Italian NER a blocking factor for you or you can go on
>> with development while waiting for the integration?
>>> 
>> 
>> Thats true. For Datasets such as DBpedia the combination of "NER +
>> NamedEntityTaggingEngine" is the way to go. Thats simple because DBpedia
>> defines Entities for nearly all natural language words. Therefore "keyword
>> extraction" (used by the KeywordLinkingEngine) does not really work.
>> 
>> However note that the KeywordLinkingEngine as support for POS (Part of
>> Speech) taggers. So if a POS tagger is available for a given language than
>> it will use this information to only lookup Nouns (see [1] for a more
>> detailed information on the used algorithm). The bad news are that there is
>> no POS tagger available for italian :(
>> 
>> [1]
>> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
>> 
>> 
>> The final possibility to improve results of the KeywordLinkingEngine with
>> DBPedia is to filter all entities with other types than Persons,
>> Organizations and Places. However this has also a big disadvantage. because
>> this will also exclude all redirects and such entities are very important
>> as they allow to link Entities that are mentioned by alternate names.
>> However if you would like to try this you should have a look at the
>> 
>>   org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter
>> 
>> This filter is included in the default configuration of the DBpedia
>> indexer and can be activated by changing the configuration within the
>> 
>>   {indexing-dir}/indexing/config/entityTypes.properties
>> 
>> 
>> 
>> @ Luca
>>> we will integrate enhancement services of NER for Italian and French
>> 
>> That would be really great. Is the framework you integrate open source?
>> Can you provide a link?
>> 
>> 
>>> Cheers,
>>> Luca
>>> 
>>> On 01/03/2012 14:49, Stefano Norcia wrote:
>>>> Hi all,
>>>> 
>>>> My name is Stefano Norcia and I'm working on the early adoption project
>> for
>>>> Etcware.
>>>> 
>>>> For our early adoption project (Etcware Early Adoption project) we need
>> to
>>>> use a DBPedia index in Italian
>>>> language in the enhancement and enrichment process enabled by the
>> Stanbol
>>>> engines.
>>>> 
>>>> The main problem is that the NLP module does not support italian
>> language
>>>> directly, so if you put an italian
>>>> text in the enhancement engine the dbpedia engine does not detect any
>>>> concept/place/people.
>>>> 
>> 
>> The NER engine uses the language as detected by the LangID engine and
>> deactivates itself if no NER model id available for the detected language.
>> In such case the NamedEntityTaggingEngine will also link no Entities
>> because the are no NamedEntities detected within the text.
>> 
>> However this dies not mean that no Italien labels are present in the
>> DBpedia index. In fact Italien labels ARE present in the all DBpedia
>> indexes. No need to build your own indexes unless you have some special
>> requirement.
>> 
>> You can try this even on the Test server. Simple send some Italien text
>> first to
>> 
>>   http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-ner
>> 
>> This engine uses "NER + NamedEntityTaggingEngine" so you will not get any
>> results - as expected. Than you can try the same text with
>> 
>>   http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword
>> 
>> this will return linked entities. But as I mentioned above and you already
>> experienced yourself it also gives a lot of false positives.
>> 
>> 
>>>> We have done some experiments to perform this goal:
>>>> 
>>>> First attempt was to rebuild the dbpedia index following the
>> instructions
>>>> found in the stanbol/
>>>> entityhub/indexing/dbpedia folder. In this folder there is a shell
>> script
>>>> (fetch_prepare.sh) that
>>>> describe how to prepare the dbpedia datasets before creating the index.
>> We
>>>> followed those
>>>> instructions and tried to create a new index to replace the standard
>>>> English dbpedia index and
>>>> "site" starting from the italian dbpedia datasets. We are aware that the
>>>> italian datasets are not
>>>> complete and that some packages are missing (like persondata_en.nt.bz2
>> and
>>>> so on).
>>>> These are the packages we used to create the index (
>>>> http://downloads.dbpedia.org/3.7/it/) :
>>>> 
>>>> o dbpedia_3.7.owl.bz2
>>>> o geo_coordinates_it.nt.bz2
>>>> o instance_types_it.nt.bz2
>>>> o labels_it.nt.bz2
>>>> o long_abstracts_it.nt.bz2
>>>> o short_abstracts_it.nt.bz2
>>>> 
>> 
>> You should always include the english versions as such include a lot of
>> information that are also very useful for other languages.
>> 
>>>> We are also able to create the incoming_links text file from the package
>>>> page_links_it.nt.bz2.
>>>> After rebuilding the index we replaced the DBPedia english index in
>> stanbol
>>>> with our custom
>>>> one (simply replacing the old one with the new one and restarting
>> stanbol).
>>>> 
>>>> Sadly, after that, the results produced by the enhancement engines are
>>>> exactly the same as before,
>>>> neither italian concepts are detected nor possible enhancements are
>>>> returned from all the other
>>>> enhancement engines.
>>>> 
>> 
>> I assume that this index was completely fine. The reason why you where not
>> getting any results was because the NER engine deactivates itself for
>> italian texts.
>> 
>> Note also the the
>> 
>> * NamedEntityTaggingEngine and
>> * KeywordLinkingEngine
>> 
>> do use the exact same DBPedia index. So you can/should use the same index
>> for both. This is also the case on the "http://dev.iks-project.eu:8081"
>> 
>> Also Note that the DBpedia indexer and the generic RDF indexer do create
>> the same type of indexes. The DBpedia indexer only contains a configuration
>> that is optimized for DBpedia.
>> 
>>>> As a second attempt, we decided to use the generic RDF indexer (combined
>>>> with the standard
>>>> Keyword Linking Engine) to process the italian DBPedia datasets; in this
>>>> case the indexing process
>>>> succeeded and we were able to get a lot of results testing the
>> enhancement
>>>> engines with italian
>>>> content. This time the problem is that the results are simply too much
>> and
>>>> contain also stopwords.
>>>> 
>>>> For example you can find a sample text introduced for enhancement and
>> the
>>>> results shown by the
>>>> Keyword Linking Engine in attachment.
>>>> 
>>>> The terms shown in bold are clearly stopwords. I don’t know if the
>> problem
>>>> is in dataset indexing,
>>>> or if there is a way to eliminate them after the creation of the index.
>> 
>> Using stop words would in fact improve the performance of the
>> KeywordLinkingEngine. The current default Solr configuration includes
>> optimized sold field configurations for english and german.
>> 
>> If you can provide such a configuration for Italien it would be great if
>> you could contribute such a configuration to Stanbol! I would be happy to
>> work on that!
>> 
>>>> 
>>>> We have also made an attempt to change the stopwords filter in the
>> solyard
>>>> base index zip
>>>> (/stanbol/entityhub/yard/solr/
>>>> src/main/resources/solr/core/default/default.solrindex.zip
>> 
>>>> 
>>>> and simple.solrindex.zip) and rebuild the content hub (and dbpedia
>> indexer
>>>> too with mvn
>>>> assembly:single in contenthub/indexer/dbpedia ) with the right
>> stopwords.
>>>> 
>> 
>> This would be the place where a Stanbol committer would change the
>> configuration. If you use the DPpedia Indexer you can simple change the
>> Solr Configuration in
>> 
>>   {indexing-root}/indexing/config/dbpedia/conf/schema.xml
>> 
>> If you use the generic RDF indexer you should extract the
>> "default.solrindex.zip" to
>> 
>>   {indexing-root}/indexing/config/
>> 
>> and than rename the directory to the same name as the name of your site
>> (this is the value of the "name" property in the
>> "/indexing/config/indexing.properties" file).
>> 
>>>> We've checked the generated JAR and the italian stopwords are there, as
>> a
>>>> file inside the solr config
>>>> folder, but the results were always the same as before (still stopwords
>> in
>>>> the enhancement results).
>>>> 
>> 
>> If you use the RDF indexer the Solr Configuration is taken
>> 
>> * from the directory "{indexing-root}/indexing/config/{name}" or if not
>> present
>> * from the class path used by the indexer
>> 
>> so the reason why it had not worked for you was that you have not creates
>> a new RDF indexer version after you changed the "default.solrindex.zip" and
>> rebuilded the Entityhub. For that you would have also needed to re-create
>> the indexer by using "mvn assembly:single".
>> 
>> But as I mentioned above there is a simpler solution for adding italian
>> stop words by simple editing the SolrConf contained in
>> 
>>   {indexing-root}/indexing/config/dbpedia/conf/
>> 
>> of the DBPedia Indexer.
>> 
>> 
>> Hopefully that answers all your questions. If you have additional
>> questions feel free to ask.
>> 
>> best
>> Rupert Westenthaler
>> 
>> 
>>>> Do you have any suggestions on how to perform these tasks?
>>>> 
>>>> Thanks in advance.
>>>> 
>>>> -Stefano
>>>> 
>>>> PS follow an enrichment example from the rdf index we built from dpedia
>>>> with simplerdfindexer and dblp :
>>>> 
>>>> text:
>>>> 
>>>> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre
>>>> un'istruttoria
>>>> 
>>>> Il Garante Privacy ha aperto un'istruttoria in seguito alla
>> pubblicazione
>>>> di notizie da parte di agenzie di stampa e quotidiani - anche on line -
>>>> che, nel riferire di un caso di una infermiera in servizio presso il
>>>> reparto di neonatologia del Policlinico Gemelli, risultata positiva ai
>> test
>>>> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
>>>> cognome e l'età.
>>>> 
>>>> Il diritto-dovere dei giornalisti di informare sugli sviluppi della
>>>> vicenda, di sicura rilevanza per l'opinione pubblica, considerato
>> l'elevato
>>>> numero di neonati e di famiglie coinvolte, deve essere comunque
>> bilanciato,
>>>> secondo i principi stabiliti dal Codice deontologico con il rispetto
>> delle
>>>> persone.
>>>> 
>>>> Il Garante ricorda che, anche quando questi dettagli fossero stati
>> forniti
>>>> in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
>>>> scrupolo l'interesse pubblico delle singole informazioni diffuse.
>>>> 
>>>> I media evitino dunque di riportare informazioni non essenziali che
>> possano
>>>> ledere la riservatezza delle persone e nello stesso tempo possano
>> indurre
>>>> ulteriori stati di allarme e di preoccupazione in coloro che si sono
>>>> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
>>>> contatto con la persona.
>>>> 
>>>> Roma, 24 agosto 2011*
>>>> 
>>>> Enrichments :
>>>> 
>>>> 2011 2011
>>>> 
>>>> Agosto Agosto
>>>> 
>>>> *Alla Alla*
>>>> 
>>>> *Anché Anché*
>>>> 
>>>> *Che? Che?*
>>>> 
>>>> Cognome Cognome
>>>> 
>>>> *CON CON*
>>>> 
>>>> *Dal' Dal'*
>>>> 
>>>> Problema dei servizi Problema dei servizi
>>>> 
>>>> *Dell Dell*
>>>> 
>>>> Diritto Diritto
>>>> 
>>>> Donna Donna
>>>> 
>>>> Essere Essere
>>>> 
>>>> Il nome della rosa Il nome della rosa
>>>> 
>>>> Informazione Informazione
>>>> 
>>>> Interesse pubblico Interesse pubblico
>>>> 
>>>> Media Media
>>>> 
>>>> Mezzi di produzione Mezzi di produzione
>>>> 
>>>> *Nello Nello*
>>>> 
>>>> Neonatologia Neonatologia
>>>> 
>>>> *NON NON*
>>>> 
>>>> Numero di coordinazione (chimica) Numero di coordinazione (chimica)
>>>> 
>>>> Opinione pubblica Opinione pubblica
>>>> 
>>>> Ospedale Ospedale
>>>> 
>>>> *PER PER*
>>>> 
>>>> Persona Persona
>>>> 
>>>> Privacy Privacy
>>>> 
>>>> Pubblicazione di matrimonio Pubblicazione di matrimonio
>>>> 
>>>> Secondo Secondo
>>>> 
>>>> Servizio Servizio
>>>> 
>>>> Stampa Stampa
>>>> 
>>>> Stati di immaginazione Stati di immaginazione
>>>> 
>>>> *SUI SUI*
>>>> 
>>>> TBC TBC
>>>> 
>>>> Tempo Tempo
>>>> 
>>>> .test .test
>>>> 
>>>> Tubercolosi Tubercolosi
>>>> 
>>>> *UNA UNA*
>>>> 
>>>> 
>>>> The ones in bold are stopwords, the other results are good ones but
>> anyway
>>>> the stopwords where not eliminated in dataset indexing, or maybe there
>> is a
>>>> way to eliminate them from the datasets but I don't know how.
>>>> 
>>> 
>> 
>> 


Mime
View raw message