Mailing-List: contact stanbol-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: stanbol-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of rupert.westenthaler@gmail.com
 designates 209.85.215.175 as permitted sender)
Received-SPF: pass (google.com: domain of rupert.westenthaler@gmail.com
 designates 10.14.28.2 as permitted sender) client-ip=10.14.28.2;
Subject: Re: Problem trying to create a new dbpedia index and site in Italian.
Mime-Version: 1.0 (Apple Message framework v1257)
Content-Type: text/plain; charset=windows-1252
From: Rupert Westenthaler <rupert.westenthaler@gmail.com>
In-Reply-To: 
 <CANbqPsfM=3qjFdOsmZVZ52D7mT0JbJ+VwB6NxZmD8i6TwbSGfA@mail.gmail.com>
Date: Sat, 3 Mar 2012 19:23:07 +0100
Cc: Alessandra Donnini <a.donnini@etcware.it>,
 Andrea Ciapetti <andrea.ciapetti@gmail.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <C93BD820-E39D-4FCB-8751-3A4D6C24FCC8@gmail.com>
References: 
 <CANbqPseUQbNBrBrfjxucspF-yefhwqmwg_p=eerbJGpE0Biq7Q@mail.gmail.com>
 <4F4F8ED1.7050400@celi.it> <2D60B564-DDFF-4C76-A243-BCF93DD2BA19@gmail.com>
 <CANbqPsfM=3qjFdOsmZVZ52D7mT0JbJ+VwB6NxZmD8i6TwbSGfA@mail.gmail.com>
To: stanbol-dev@incubator.apache.org

Looks good for me.=20

Do not forget to

1. also define the <analyzer type=3D"query">

2. to use the *text_it* in the <field> and <dynamicField> definitions

this includes

changing the type value of
      =20
    <field name=3D"@it/dbp-ont:abstract/"  type=3D"textgen"

to

   <field name=3D"@it/dbp-ont:abstract/"  type=3D"*text_it*"

and adding

     <dynamicField name=3D"@it*"  type=3D"*text_it*" indexed=3D"true" =
stored=3D"true" multiValued=3D"true" omitNorms=3D"false"/>

Hope this improves results for Italien!

best
Rupert

On 03.03.2012, at 18:27, Stefano Norcia wrote:

> Hi Rupert,
> thank you very much for your answer, it was very helpful and let me
> understand deeper how the entity hub indexing work.
> I wrote a possible candidate for the text_it field in schema.xml for =
the
> various indexers in Stambol :
>=20
> <fieldType name=3D"*text_it*" class=3D"solr.TextField"
> positionIncrementGap=3D"100" omitNorms=3D"false">
>      <analyzer type=3D"index">
>        <tokenizer class=3D"solr.WhitespaceTokenizerFactory"/>
>        <filter class=3D"solr.SynonymFilterFactory"
> synonyms=3D"synonyms_it.txt" ignoreCase=3D"true" expand=3D"false"/>
>        <filter class=3D"solr.HyphenatedWordsFilterFactory"/>
>        <filter class=3D"solr.WordDelimiterFilterFactory"
> generateWordParts=3D"1" generateNumberParts=3D"1" catenateWords=3D"1"
> catenateNumbers=3D"1" catenateAll=3D"0" splitOnCaseChange=3D"1" />
>        <filter class=3D"solr.LowerCaseFilterFactory"/>
> *         <filter class=3D"solr.StopFilterFactory" ignoreCase=3D"true"
> words=3D"italian_stop.txt" enablePositionIncrements=3D"true" />*
> *        <filter class=3D"solr.SnowballPorterFilterFactory"
> language=3D"Italian" />*
>         <filter class=3D"solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>=20
> you can see that I used the SnowballPorterFactory as suggested in
> http://wiki.apache.org/solr/LanguageAnalysis#Italian,
> the stopwords list for Italian can be found at this link :
>=20
> =
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/s=
rc/resources/org/apache/lucene/analysis/snowball/italian_stop.txt
>=20
> If someone more skilled than me in Solr is reading feel free to =
correct or
> suggest a better definition for the field.
>=20
> I'm going to test indexing with this configuration, I'll let you know =
about
> my progress (if any).
>=20
> Regards,
> Stefano
>=20
>=20
> On Thu, Mar 1, 2012 at 7:47 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
>=20
>> Hi Stefano, Luca
>>=20
>> See my comments inline.
>>=20
>> On 01.03.2012, at 15:59, Luca Dini wrote:
>>=20
>>> Dear Stefano,
>>> I am new as well on the list, and we are also working in the context =
of
>> the early adoption program. If I understand correctly, the problem is =
that
>> without an appropriate Named Entities extraction engine for Italian, =
I am
>> afraid that the result would always be disappointing. In the context =
of our
>> project we will integrate enhancement services of NER for Italian and
>> French (and possibly keyword extraction), so, hopefully, you will be =
able
>> to profit of the power of Stanbol. There might be some problems in =
terms of
>> timing, as it is not clear if in the short project window, there will =
be
>> the possibility of  feeding our integration into yours. Is the
>> unavailability of Italian NER a blocking factor for you or you can go =
on
>> with development while waiting for the integration?
>>>=20
>>=20
>> Thats true. For Datasets such as DBpedia the combination of "NER +
>> NamedEntityTaggingEngine" is the way to go. Thats simple because =
DBpedia
>> defines Entities for nearly all natural language words. Therefore =
"keyword
>> extraction" (used by the KeywordLinkingEngine) does not really work.
>>=20
>> However note that the KeywordLinkingEngine as support for POS (Part =
of
>> Speech) taggers. So if a POS tagger is available for a given language =
than
>> it will use this information to only lookup Nouns (see [1] for a more
>> detailed information on the used algorithm). The bad news are that =
there is
>> no POS tagger available for italian :(
>>=20
>> [1]
>> =
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlin=
kingengine.html
>>=20
>>=20
>> The final possibility to improve results of the KeywordLinkingEngine =
with
>> DBPedia is to filter all entities with other types than Persons,
>> Organizations and Places. However this has also a big disadvantage. =
because
>> this will also exclude all redirects and such entities are very =
important
>> as they allow to link Entities that are mentioned by alternate names.
>> However if you would like to try this you should have a look at the
>>=20
>>   =
org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter
>>=20
>> This filter is included in the default configuration of the DBpedia
>> indexer and can be activated by changing the configuration within the
>>=20
>>   {indexing-dir}/indexing/config/entityTypes.properties
>>=20
>>=20
>>=20
>> @ Luca
>>> we will integrate enhancement services of NER for Italian and French
>>=20
>> That would be really great. Is the framework you integrate open =
source?
>> Can you provide a link?
>>=20
>>=20
>>> Cheers,
>>> Luca
>>>=20
>>> On 01/03/2012 14:49, Stefano Norcia wrote:
>>>> Hi all,
>>>>=20
>>>> My name is Stefano Norcia and I'm working on the early adoption =
project
>> for
>>>> Etcware.
>>>>=20
>>>> For our early adoption project (Etcware Early Adoption project) we =
need
>> to
>>>> use a DBPedia index in Italian
>>>> language in the enhancement and enrichment process enabled by the
>> Stanbol
>>>> engines.
>>>>=20
>>>> The main problem is that the NLP module does not support italian
>> language
>>>> directly, so if you put an italian
>>>> text in the enhancement engine the dbpedia engine does not detect =
any
>>>> concept/place/people.
>>>>=20
>>=20
>> The NER engine uses the language as detected by the LangID engine and
>> deactivates itself if no NER model id available for the detected =
language.
>> In such case the NamedEntityTaggingEngine will also link no Entities
>> because the are no NamedEntities detected within the text.
>>=20
>> However this dies not mean that no Italien labels are present in the
>> DBpedia index. In fact Italien labels ARE present in the all DBpedia
>> indexes. No need to build your own indexes unless you have some =
special
>> requirement.
>>=20
>> You can try this even on the Test server. Simple send some Italien =
text
>> first to
>>=20
>>   http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-ner
>>=20
>> This engine uses "NER + NamedEntityTaggingEngine" so you will not get =
any
>> results - as expected. Than you can try the same text with
>>=20
>>   http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword
>>=20
>> this will return linked entities. But as I mentioned above and you =
already
>> experienced yourself it also gives a lot of false positives.
>>=20
>>=20
>>>> We have done some experiments to perform this goal:
>>>>=20
>>>> First attempt was to rebuild the dbpedia index following the
>> instructions
>>>> found in the stanbol/
>>>> entityhub/indexing/dbpedia folder. In this folder there is a shell
>> script
>>>> (fetch_prepare.sh) that
>>>> describe how to prepare the dbpedia datasets before creating the =
index.
>> We
>>>> followed those
>>>> instructions and tried to create a new index to replace the =
standard
>>>> English dbpedia index and
>>>> "site" starting from the italian dbpedia datasets. We are aware =
that the
>>>> italian datasets are not
>>>> complete and that some packages are missing (like =
persondata_en.nt.bz2
>> and
>>>> so on).
>>>> These are the packages we used to create the index (
>>>> http://downloads.dbpedia.org/3.7/it/) :
>>>>=20
>>>> o dbpedia_3.7.owl.bz2
>>>> o geo_coordinates_it.nt.bz2
>>>> o instance_types_it.nt.bz2
>>>> o labels_it.nt.bz2
>>>> o long_abstracts_it.nt.bz2
>>>> o short_abstracts_it.nt.bz2
>>>>=20
>>=20
>> You should always include the english versions as such include a lot =
of
>> information that are also very useful for other languages.
>>=20
>>>> We are also able to create the incoming_links text file from the =
package
>>>> page_links_it.nt.bz2.
>>>> After rebuilding the index we replaced the DBPedia english index in
>> stanbol
>>>> with our custom
>>>> one (simply replacing the old one with the new one and restarting
>> stanbol).
>>>>=20
>>>> Sadly, after that, the results produced by the enhancement engines =
are
>>>> exactly the same as before,
>>>> neither italian concepts are detected nor possible enhancements are
>>>> returned from all the other
>>>> enhancement engines.
>>>>=20
>>=20
>> I assume that this index was completely fine. The reason why you =
where not
>> getting any results was because the NER engine deactivates itself for
>> italian texts.
>>=20
>> Note also the the
>>=20
>> * NamedEntityTaggingEngine and
>> * KeywordLinkingEngine
>>=20
>> do use the exact same DBPedia index. So you can/should use the same =
index
>> for both. This is also the case on the =
"http://dev.iks-project.eu:8081"
>>=20
>> Also Note that the DBpedia indexer and the generic RDF indexer do =
create
>> the same type of indexes. The DBpedia indexer only contains a =
configuration
>> that is optimized for DBpedia.
>>=20
>>>> As a second attempt, we decided to use the generic RDF indexer =
(combined
>>>> with the standard
>>>> Keyword Linking Engine) to process the italian DBPedia datasets; in =
this
>>>> case the indexing process
>>>> succeeded and we were able to get a lot of results testing the
>> enhancement
>>>> engines with italian
>>>> content. This time the problem is that the results are simply too =
much
>> and
>>>> contain also stopwords.
>>>>=20
>>>> For example you can find a sample text introduced for enhancement =
and
>> the
>>>> results shown by the
>>>> Keyword Linking Engine in attachment.
>>>>=20
>>>> The terms shown in bold are clearly stopwords. I don=92t know if =
the
>> problem
>>>> is in dataset indexing,
>>>> or if there is a way to eliminate them after the creation of the =
index.
>>=20
>> Using stop words would in fact improve the performance of the
>> KeywordLinkingEngine. The current default Solr configuration includes
>> optimized sold field configurations for english and german.
>>=20
>> If you can provide such a configuration for Italien it would be great =
if
>> you could contribute such a configuration to Stanbol! I would be =
happy to
>> work on that!
>>=20
>>>>=20
>>>> We have also made an attempt to change the stopwords filter in the
>> solyard
>>>> base index zip
>>>> (/stanbol/entityhub/yard/solr/
>>>> src/main/resources/solr/core/default/default.solrindex.zip
>>=20
>>>>=20
>>>> and simple.solrindex.zip) and rebuild the content hub (and dbpedia
>> indexer
>>>> too with mvn
>>>> assembly:single in contenthub/indexer/dbpedia ) with the right
>> stopwords.
>>>>=20
>>=20
>> This would be the place where a Stanbol committer would change the
>> configuration. If you use the DPpedia Indexer you can simple change =
the
>> Solr Configuration in
>>=20
>>   {indexing-root}/indexing/config/dbpedia/conf/schema.xml
>>=20
>> If you use the generic RDF indexer you should extract the
>> "default.solrindex.zip" to
>>=20
>>   {indexing-root}/indexing/config/
>>=20
>> and than rename the directory to the same name as the name of your =
site
>> (this is the value of the "name" property in the
>> "/indexing/config/indexing.properties" file).
>>=20
>>>> We've checked the generated JAR and the italian stopwords are =
there, as
>> a
>>>> file inside the solr config
>>>> folder, but the results were always the same as before (still =
stopwords
>> in
>>>> the enhancement results).
>>>>=20
>>=20
>> If you use the RDF indexer the Solr Configuration is taken
>>=20
>> * from the directory "{indexing-root}/indexing/config/{name}" or if =
not
>> present
>> * from the class path used by the indexer
>>=20
>> so the reason why it had not worked for you was that you have not =
creates
>> a new RDF indexer version after you changed the =
"default.solrindex.zip" and
>> rebuilded the Entityhub. For that you would have also needed to =
re-create
>> the indexer by using "mvn assembly:single".
>>=20
>> But as I mentioned above there is a simpler solution for adding =
italian
>> stop words by simple editing the SolrConf contained in
>>=20
>>   {indexing-root}/indexing/config/dbpedia/conf/
>>=20
>> of the DBPedia Indexer.
>>=20
>>=20
>> Hopefully that answers all your questions. If you have additional
>> questions feel free to ask.
>>=20
>> best
>> Rupert Westenthaler
>>=20
>>=20
>>>> Do you have any suggestions on how to perform these tasks?
>>>>=20
>>>> Thanks in advance.
>>>>=20
>>>> -Stefano
>>>>=20
>>>> PS follow an enrichment example from the rdf index we built from =
dpedia
>>>> with simplerdfindexer and dblp :
>>>>=20
>>>> text:
>>>>=20
>>>> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre
>>>> un'istruttoria
>>>>=20
>>>> Il Garante Privacy ha aperto un'istruttoria in seguito alla
>> pubblicazione
>>>> di notizie da parte di agenzie di stampa e quotidiani - anche on =
line -
>>>> che, nel riferire di un caso di una infermiera in servizio presso =
il
>>>> reparto di neonatologia del Policlinico Gemelli, risultata positiva =
ai
>> test
>>>> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale =
del
>>>> cognome e l'et=E0.
>>>>=20
>>>> Il diritto-dovere dei giornalisti di informare sugli sviluppi della
>>>> vicenda, di sicura rilevanza per l'opinione pubblica, considerato
>> l'elevato
>>>> numero di neonati e di famiglie coinvolte, deve essere comunque
>> bilanciato,
>>>> secondo i principi stabiliti dal Codice deontologico con il =
rispetto
>> delle
>>>> persone.
>>>>=20
>>>> Il Garante ricorda che, anche quando questi dettagli fossero stati
>> forniti
>>>> in una sede pubblica, i mezzi di informazione sono tenuti a =
valutare con
>>>> scrupolo l'interesse pubblico delle singole informazioni diffuse.
>>>>=20
>>>> I media evitino dunque di riportare informazioni non essenziali che
>> possano
>>>> ledere la riservatezza delle persone e nello stesso tempo possano
>> indurre
>>>> ulteriori stati di allarme e di preoccupazione in coloro che si =
sono
>>>> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti =
entrati in
>>>> contatto con la persona.
>>>>=20
>>>> Roma, 24 agosto 2011*
>>>>=20
>>>> Enrichments :
>>>>=20
>>>> 2011 2011
>>>>=20
>>>> Agosto Agosto
>>>>=20
>>>> *Alla Alla*
>>>>=20
>>>> *Anch=E9 Anch=E9*
>>>>=20
>>>> *Che? Che?*
>>>>=20
>>>> Cognome Cognome
>>>>=20
>>>> *CON CON*
>>>>=20
>>>> *Dal' Dal'*
>>>>=20
>>>> Problema dei servizi Problema dei servizi
>>>>=20
>>>> *Dell Dell*
>>>>=20
>>>> Diritto Diritto
>>>>=20
>>>> Donna Donna
>>>>=20
>>>> Essere Essere
>>>>=20
>>>> Il nome della rosa Il nome della rosa
>>>>=20
>>>> Informazione Informazione
>>>>=20
>>>> Interesse pubblico Interesse pubblico
>>>>=20
>>>> Media Media
>>>>=20
>>>> Mezzi di produzione Mezzi di produzione
>>>>=20
>>>> *Nello Nello*
>>>>=20
>>>> Neonatologia Neonatologia
>>>>=20
>>>> *NON NON*
>>>>=20
>>>> Numero di coordinazione (chimica) Numero di coordinazione (chimica)
>>>>=20
>>>> Opinione pubblica Opinione pubblica
>>>>=20
>>>> Ospedale Ospedale
>>>>=20
>>>> *PER PER*
>>>>=20
>>>> Persona Persona
>>>>=20
>>>> Privacy Privacy
>>>>=20
>>>> Pubblicazione di matrimonio Pubblicazione di matrimonio
>>>>=20
>>>> Secondo Secondo
>>>>=20
>>>> Servizio Servizio
>>>>=20
>>>> Stampa Stampa
>>>>=20
>>>> Stati di immaginazione Stati di immaginazione
>>>>=20
>>>> *SUI SUI*
>>>>=20
>>>> TBC TBC
>>>>=20
>>>> Tempo Tempo
>>>>=20
>>>> .test .test
>>>>=20
>>>> Tubercolosi Tubercolosi
>>>>=20
>>>> *UNA UNA*
>>>>=20
>>>>=20
>>>> The ones in bold are stopwords, the other results are good ones but
>> anyway
>>>> the stopwords where not eliminated in dataset indexing, or maybe =
there
>> is a
>>>> way to eliminate them from the datasets but I don't know how.
>>>>=20
>>>=20
>>=20
>>=20