Return-Path: X-Original-To: apmail-incubator-stanbol-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-stanbol-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B73799A2 for ; Sat, 3 Mar 2012 18:23:38 +0000 (UTC) Received: (qmail 4965 invoked by uid 500); 3 Mar 2012 18:23:38 -0000 Delivered-To: apmail-incubator-stanbol-dev-archive@incubator.apache.org Received: (qmail 4906 invoked by uid 500); 3 Mar 2012 18:23:37 -0000 Mailing-List: contact stanbol-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: stanbol-dev@incubator.apache.org Delivered-To: mailing list stanbol-dev@incubator.apache.org Received: (qmail 4892 invoked by uid 99); 3 Mar 2012 18:23:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2012 18:23:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rupert.westenthaler@gmail.com designates 209.85.215.175 as permitted sender) Received: from [209.85.215.175] (HELO mail-ey0-f175.google.com) (209.85.215.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2012 18:23:32 +0000 Received: by eaae1 with SMTP id e1so772891eaa.6 for ; Sat, 03 Mar 2012 10:23:10 -0800 (PST) Received-SPF: pass (google.com: domain of rupert.westenthaler@gmail.com designates 10.14.28.2 as permitted sender) client-ip=10.14.28.2; Authentication-Results: mr.google.com; spf=pass (google.com: domain of rupert.westenthaler@gmail.com designates 10.14.28.2 as permitted sender) smtp.mail=rupert.westenthaler@gmail.com; dkim=pass header.i=rupert.westenthaler@gmail.com Received: from mr.google.com ([10.14.28.2]) by 10.14.28.2 with SMTP id f2mr8476247eea.13.1330798990802 (num_hops = 1); Sat, 03 Mar 2012 10:23:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=X4uAmqfh97U4TIBMzgcB+zbnFb6FJ8HCLItGE8Yy2C0=; b=nHtub8ZvpOq78u83WOxR5NXfrVCAW3H+8tLSBfiZ5L7xPY2Bzyoi6DBA2QWuJdaG/G man7CiBlRSou3WFyjUczOjM1i/iPQNvbmzqgy6nObsxNT0ou8CHUyKqgtXuV1+InQ7WB cfvutzEeRw3Pp5IpOzJqYnDTTis/BU0ioDidkKNR9sIss9EUsutR5BCo9Vf9AV0xzkT5 30JV6ca837k0jX7h2c6m+3Chm4cVu+WVAhXxli5wU1TzV8UrRQDvvpNWmbOIyFEKJOys QvNv30wgZ3rG10vJsiZ108qcqOQjhFxKp/n1V8QWAjFUke4qpcqvRErLr42l2W/hpq3r zrCg== Received: by 10.14.28.2 with SMTP id f2mr6490511eea.13.1330798990699; Sat, 03 Mar 2012 10:23:10 -0800 (PST) Received: from [10.0.1.2] (83-215-82-37.stjo.dyn.salzburg-online.at. [83.215.82.37]) by mx.google.com with ESMTPS id n17sm37219154eei.3.2012.03.03.10.23.09 (version=SSLv3 cipher=OTHER); Sat, 03 Mar 2012 10:23:09 -0800 (PST) Subject: Re: Problem trying to create a new dbpedia index and site in Italian. Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: text/plain; charset=windows-1252 From: Rupert Westenthaler In-Reply-To: Date: Sat, 3 Mar 2012 19:23:07 +0100 Cc: Alessandra Donnini , Andrea Ciapetti Content-Transfer-Encoding: quoted-printable Message-Id: References: <4F4F8ED1.7050400@celi.it> <2D60B564-DDFF-4C76-A243-BCF93DD2BA19@gmail.com> To: stanbol-dev@incubator.apache.org X-Mailer: Apple Mail (2.1257) X-Virus-Checked: Checked by ClamAV on apache.org Looks good for me.=20 Do not forget to 1. also define the 2. to use the *text_it* in the and definitions this includes changing the type value of =20 Hope this improves results for Italien! best Rupert On 03.03.2012, at 18:27, Stefano Norcia wrote: > Hi Rupert, > thank you very much for your answer, it was very helpful and let me > understand deeper how the entity hub indexing work. > I wrote a possible candidate for the text_it field in schema.xml for = the > various indexers in Stambol : >=20 > positionIncrementGap=3D"100" omitNorms=3D"false"> > > > synonyms=3D"synonyms_it.txt" ignoreCase=3D"true" expand=3D"false"/> > > generateWordParts=3D"1" generateNumberParts=3D"1" catenateWords=3D"1" > catenateNumbers=3D"1" catenateAll=3D"0" splitOnCaseChange=3D"1" /> > > * words=3D"italian_stop.txt" enablePositionIncrements=3D"true" />* > * language=3D"Italian" />* > > >=20 > you can see that I used the SnowballPorterFactory as suggested in > http://wiki.apache.org/solr/LanguageAnalysis#Italian, > the stopwords list for Italian can be found at this link : >=20 > = http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/s= rc/resources/org/apache/lucene/analysis/snowball/italian_stop.txt >=20 > If someone more skilled than me in Solr is reading feel free to = correct or > suggest a better definition for the field. >=20 > I'm going to test indexing with this configuration, I'll let you know = about > my progress (if any). >=20 > Regards, > Stefano >=20 >=20 > On Thu, Mar 1, 2012 at 7:47 PM, Rupert Westenthaler < > rupert.westenthaler@gmail.com> wrote: >=20 >> Hi Stefano, Luca >>=20 >> See my comments inline. >>=20 >> On 01.03.2012, at 15:59, Luca Dini wrote: >>=20 >>> Dear Stefano, >>> I am new as well on the list, and we are also working in the context = of >> the early adoption program. If I understand correctly, the problem is = that >> without an appropriate Named Entities extraction engine for Italian, = I am >> afraid that the result would always be disappointing. In the context = of our >> project we will integrate enhancement services of NER for Italian and >> French (and possibly keyword extraction), so, hopefully, you will be = able >> to profit of the power of Stanbol. There might be some problems in = terms of >> timing, as it is not clear if in the short project window, there will = be >> the possibility of feeding our integration into yours. Is the >> unavailability of Italian NER a blocking factor for you or you can go = on >> with development while waiting for the integration? >>>=20 >>=20 >> Thats true. For Datasets such as DBpedia the combination of "NER + >> NamedEntityTaggingEngine" is the way to go. Thats simple because = DBpedia >> defines Entities for nearly all natural language words. Therefore = "keyword >> extraction" (used by the KeywordLinkingEngine) does not really work. >>=20 >> However note that the KeywordLinkingEngine as support for POS (Part = of >> Speech) taggers. So if a POS tagger is available for a given language = than >> it will use this information to only lookup Nouns (see [1] for a more >> detailed information on the used algorithm). The bad news are that = there is >> no POS tagger available for italian :( >>=20 >> [1] >> = http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlin= kingengine.html >>=20 >>=20 >> The final possibility to improve results of the KeywordLinkingEngine = with >> DBPedia is to filter all entities with other types than Persons, >> Organizations and Places. However this has also a big disadvantage. = because >> this will also exclude all redirects and such entities are very = important >> as they allow to link Entities that are mentioned by alternate names. >> However if you would like to try this you should have a look at the >>=20 >> = org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter >>=20 >> This filter is included in the default configuration of the DBpedia >> indexer and can be activated by changing the configuration within the >>=20 >> {indexing-dir}/indexing/config/entityTypes.properties >>=20 >>=20 >>=20 >> @ Luca >>> we will integrate enhancement services of NER for Italian and French >>=20 >> That would be really great. Is the framework you integrate open = source? >> Can you provide a link? >>=20 >>=20 >>> Cheers, >>> Luca >>>=20 >>> On 01/03/2012 14:49, Stefano Norcia wrote: >>>> Hi all, >>>>=20 >>>> My name is Stefano Norcia and I'm working on the early adoption = project >> for >>>> Etcware. >>>>=20 >>>> For our early adoption project (Etcware Early Adoption project) we = need >> to >>>> use a DBPedia index in Italian >>>> language in the enhancement and enrichment process enabled by the >> Stanbol >>>> engines. >>>>=20 >>>> The main problem is that the NLP module does not support italian >> language >>>> directly, so if you put an italian >>>> text in the enhancement engine the dbpedia engine does not detect = any >>>> concept/place/people. >>>>=20 >>=20 >> The NER engine uses the language as detected by the LangID engine and >> deactivates itself if no NER model id available for the detected = language. >> In such case the NamedEntityTaggingEngine will also link no Entities >> because the are no NamedEntities detected within the text. >>=20 >> However this dies not mean that no Italien labels are present in the >> DBpedia index. In fact Italien labels ARE present in the all DBpedia >> indexes. No need to build your own indexes unless you have some = special >> requirement. >>=20 >> You can try this even on the Test server. Simple send some Italien = text >> first to >>=20 >> http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-ner >>=20 >> This engine uses "NER + NamedEntityTaggingEngine" so you will not get = any >> results - as expected. Than you can try the same text with >>=20 >> http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword >>=20 >> this will return linked entities. But as I mentioned above and you = already >> experienced yourself it also gives a lot of false positives. >>=20 >>=20 >>>> We have done some experiments to perform this goal: >>>>=20 >>>> First attempt was to rebuild the dbpedia index following the >> instructions >>>> found in the stanbol/ >>>> entityhub/indexing/dbpedia folder. In this folder there is a shell >> script >>>> (fetch_prepare.sh) that >>>> describe how to prepare the dbpedia datasets before creating the = index. >> We >>>> followed those >>>> instructions and tried to create a new index to replace the = standard >>>> English dbpedia index and >>>> "site" starting from the italian dbpedia datasets. We are aware = that the >>>> italian datasets are not >>>> complete and that some packages are missing (like = persondata_en.nt.bz2 >> and >>>> so on). >>>> These are the packages we used to create the index ( >>>> http://downloads.dbpedia.org/3.7/it/) : >>>>=20 >>>> o dbpedia_3.7.owl.bz2 >>>> o geo_coordinates_it.nt.bz2 >>>> o instance_types_it.nt.bz2 >>>> o labels_it.nt.bz2 >>>> o long_abstracts_it.nt.bz2 >>>> o short_abstracts_it.nt.bz2 >>>>=20 >>=20 >> You should always include the english versions as such include a lot = of >> information that are also very useful for other languages. >>=20 >>>> We are also able to create the incoming_links text file from the = package >>>> page_links_it.nt.bz2. >>>> After rebuilding the index we replaced the DBPedia english index in >> stanbol >>>> with our custom >>>> one (simply replacing the old one with the new one and restarting >> stanbol). >>>>=20 >>>> Sadly, after that, the results produced by the enhancement engines = are >>>> exactly the same as before, >>>> neither italian concepts are detected nor possible enhancements are >>>> returned from all the other >>>> enhancement engines. >>>>=20 >>=20 >> I assume that this index was completely fine. The reason why you = where not >> getting any results was because the NER engine deactivates itself for >> italian texts. >>=20 >> Note also the the >>=20 >> * NamedEntityTaggingEngine and >> * KeywordLinkingEngine >>=20 >> do use the exact same DBPedia index. So you can/should use the same = index >> for both. This is also the case on the = "http://dev.iks-project.eu:8081" >>=20 >> Also Note that the DBpedia indexer and the generic RDF indexer do = create >> the same type of indexes. The DBpedia indexer only contains a = configuration >> that is optimized for DBpedia. >>=20 >>>> As a second attempt, we decided to use the generic RDF indexer = (combined >>>> with the standard >>>> Keyword Linking Engine) to process the italian DBPedia datasets; in = this >>>> case the indexing process >>>> succeeded and we were able to get a lot of results testing the >> enhancement >>>> engines with italian >>>> content. This time the problem is that the results are simply too = much >> and >>>> contain also stopwords. >>>>=20 >>>> For example you can find a sample text introduced for enhancement = and >> the >>>> results shown by the >>>> Keyword Linking Engine in attachment. >>>>=20 >>>> The terms shown in bold are clearly stopwords. I don=92t know if = the >> problem >>>> is in dataset indexing, >>>> or if there is a way to eliminate them after the creation of the = index. >>=20 >> Using stop words would in fact improve the performance of the >> KeywordLinkingEngine. The current default Solr configuration includes >> optimized sold field configurations for english and german. >>=20 >> If you can provide such a configuration for Italien it would be great = if >> you could contribute such a configuration to Stanbol! I would be = happy to >> work on that! >>=20 >>>>=20 >>>> We have also made an attempt to change the stopwords filter in the >> solyard >>>> base index zip >>>> (/stanbol/entityhub/yard/solr/ >>>> src/main/resources/solr/core/default/default.solrindex.zip >>=20 >>>>=20 >>>> and simple.solrindex.zip) and rebuild the content hub (and dbpedia >> indexer >>>> too with mvn >>>> assembly:single in contenthub/indexer/dbpedia ) with the right >> stopwords. >>>>=20 >>=20 >> This would be the place where a Stanbol committer would change the >> configuration. If you use the DPpedia Indexer you can simple change = the >> Solr Configuration in >>=20 >> {indexing-root}/indexing/config/dbpedia/conf/schema.xml >>=20 >> If you use the generic RDF indexer you should extract the >> "default.solrindex.zip" to >>=20 >> {indexing-root}/indexing/config/ >>=20 >> and than rename the directory to the same name as the name of your = site >> (this is the value of the "name" property in the >> "/indexing/config/indexing.properties" file). >>=20 >>>> We've checked the generated JAR and the italian stopwords are = there, as >> a >>>> file inside the solr config >>>> folder, but the results were always the same as before (still = stopwords >> in >>>> the enhancement results). >>>>=20 >>=20 >> If you use the RDF indexer the Solr Configuration is taken >>=20 >> * from the directory "{indexing-root}/indexing/config/{name}" or if = not >> present >> * from the class path used by the indexer >>=20 >> so the reason why it had not worked for you was that you have not = creates >> a new RDF indexer version after you changed the = "default.solrindex.zip" and >> rebuilded the Entityhub. For that you would have also needed to = re-create >> the indexer by using "mvn assembly:single". >>=20 >> But as I mentioned above there is a simpler solution for adding = italian >> stop words by simple editing the SolrConf contained in >>=20 >> {indexing-root}/indexing/config/dbpedia/conf/ >>=20 >> of the DBPedia Indexer. >>=20 >>=20 >> Hopefully that answers all your questions. If you have additional >> questions feel free to ask. >>=20 >> best >> Rupert Westenthaler >>=20 >>=20 >>>> Do you have any suggestions on how to perform these tasks? >>>>=20 >>>> Thanks in advance. >>>>=20 >>>> -Stefano >>>>=20 >>>> PS follow an enrichment example from the rdf index we built from = dpedia >>>> with simplerdfindexer and dblp : >>>>=20 >>>> text: >>>>=20 >>>> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre >>>> un'istruttoria >>>>=20 >>>> Il Garante Privacy ha aperto un'istruttoria in seguito alla >> pubblicazione >>>> di notizie da parte di agenzie di stampa e quotidiani - anche on = line - >>>> che, nel riferire di un caso di una infermiera in servizio presso = il >>>> reparto di neonatologia del Policlinico Gemelli, risultata positiva = ai >> test >>>> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale = del >>>> cognome e l'et=E0. >>>>=20 >>>> Il diritto-dovere dei giornalisti di informare sugli sviluppi della >>>> vicenda, di sicura rilevanza per l'opinione pubblica, considerato >> l'elevato >>>> numero di neonati e di famiglie coinvolte, deve essere comunque >> bilanciato, >>>> secondo i principi stabiliti dal Codice deontologico con il = rispetto >> delle >>>> persone. >>>>=20 >>>> Il Garante ricorda che, anche quando questi dettagli fossero stati >> forniti >>>> in una sede pubblica, i mezzi di informazione sono tenuti a = valutare con >>>> scrupolo l'interesse pubblico delle singole informazioni diffuse. >>>>=20 >>>> I media evitino dunque di riportare informazioni non essenziali che >> possano >>>> ledere la riservatezza delle persone e nello stesso tempo possano >> indurre >>>> ulteriori stati di allarme e di preoccupazione in coloro che si = sono >>>> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti = entrati in >>>> contatto con la persona. >>>>=20 >>>> Roma, 24 agosto 2011* >>>>=20 >>>> Enrichments : >>>>=20 >>>> 2011 2011 >>>>=20 >>>> Agosto Agosto >>>>=20 >>>> *Alla Alla* >>>>=20 >>>> *Anch=E9 Anch=E9* >>>>=20 >>>> *Che? Che?* >>>>=20 >>>> Cognome Cognome >>>>=20 >>>> *CON CON* >>>>=20 >>>> *Dal' Dal'* >>>>=20 >>>> Problema dei servizi Problema dei servizi >>>>=20 >>>> *Dell Dell* >>>>=20 >>>> Diritto Diritto >>>>=20 >>>> Donna Donna >>>>=20 >>>> Essere Essere >>>>=20 >>>> Il nome della rosa Il nome della rosa >>>>=20 >>>> Informazione Informazione >>>>=20 >>>> Interesse pubblico Interesse pubblico >>>>=20 >>>> Media Media >>>>=20 >>>> Mezzi di produzione Mezzi di produzione >>>>=20 >>>> *Nello Nello* >>>>=20 >>>> Neonatologia Neonatologia >>>>=20 >>>> *NON NON* >>>>=20 >>>> Numero di coordinazione (chimica) Numero di coordinazione (chimica) >>>>=20 >>>> Opinione pubblica Opinione pubblica >>>>=20 >>>> Ospedale Ospedale >>>>=20 >>>> *PER PER* >>>>=20 >>>> Persona Persona >>>>=20 >>>> Privacy Privacy >>>>=20 >>>> Pubblicazione di matrimonio Pubblicazione di matrimonio >>>>=20 >>>> Secondo Secondo >>>>=20 >>>> Servizio Servizio >>>>=20 >>>> Stampa Stampa >>>>=20 >>>> Stati di immaginazione Stati di immaginazione >>>>=20 >>>> *SUI SUI* >>>>=20 >>>> TBC TBC >>>>=20 >>>> Tempo Tempo >>>>=20 >>>> .test .test >>>>=20 >>>> Tubercolosi Tubercolosi >>>>=20 >>>> *UNA UNA* >>>>=20 >>>>=20 >>>> The ones in bold are stopwords, the other results are good ones but >> anyway >>>> the stopwords where not eliminated in dataset indexing, or maybe = there >> is a >>>> way to eliminate them from the datasets but I don't know how. >>>>=20 >>>=20 >>=20 >>=20