Mailing-List: contact dev-help@stanbol.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@stanbol.apache.org
Received-SPF: pass (nike.apache.org: domain of rupert.westenthaler@gmail.com
 designates 209.85.210.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKPwq7eJ8mzuQabSCkpXDUNitZy9w4xVWntdZmkaD7fvA0DYTg@mail.gmail.com>
References: 
 <CAKPwq7eJ8mzuQabSCkpXDUNitZy9w4xVWntdZmkaD7fvA0DYTg@mail.gmail.com>
Date: Sun, 11 Nov 2012 08:54:40 +0100
Message-ID: 
 <CAA7LAO031mO43chE-X+C+Eut0CGwg=EE6WZqML0g7bedC88WSQ@mail.gmail.com>
Subject: Re: Apache Stanbol: technical documentation and disambiguation
From: Rupert Westenthaler <rupert.westenthaler@gmail.com>
To: dev@stanbol.apache.org, Jairo Sarabia <jairo.sarabia@appstylus.com>,
	kritarth anand <kritarth.anand@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Jairo

thanks for your feedback regarding the disambiguation engine

On Fri, Nov 9, 2012 at 6:51 PM, Jairo Sarabia
<jairo.sarabia@appstylus.com> wrote:
> I'm Jairo Sarabia, a web developer at Notedlinks S.L. from Barcelone
> (Spain).
> We're very interested on Apache Stanbol and we would like to know how
> Stanbol works internally, so how works the framework is used, the directo=
ry
> structure and how works files of configuration.
> Is there any documentation about these? Could you send me?
>

For the Stanbol Enhancer there is a Developer level documentation available=
.

    http://stanbol.apache.org/docs/trunk/components/enhancer/

is the starting point. The Section "Main Interfaces and Utility
Classes" links to
the description of the different components.

> Meanwhile, thank and congratulate you because we tested the disambiguatio=
n
> engine and we liked the improved responses in English, although I underst=
and
> that the quality is still regularly in some respects. Especially with top=
ics
> of Person and Organizations, so most times only detects part of the name =
and
> especially in compound words, and this makes the disambiguation is wrong.

This is probably because the disambiguation Engine does not refine the
fise:selected-text of the fise:TextAnnotation based on disambiguation
results. Can you provide some examples of this behavior so that I can
validate this assumption.

> We would like to know about future plans for the disambiguation engine, a=
nd
> whether it can be used for other languages.

Stanbol is a community driven Project. The engine itself was developed
by Kritarth Anand in a GSoC project [1] and contributed to Stanbol
with STANBOL-723 [2]. I am was mentoring this project.

I do not know Kritarth plans, but personally I plan is to continue
work on this engine as soon as I have finished - meaning re-integrated
the Stanbol NLP module with the trunk. This work will mainly focus on
making the MLT disambiguation engine configureable and testing that it
works well with the new Stanbol NLP processing module (STANBOL-733).


[1] http://www.google-melange.com/gsoc/project/google/gsoc2012/kritarth/120=
01
[2] https://issues.apache.org/jira/browse/STANBOL-723

>
> Finally, we would like to know if it is possible to create multilingual
> DBpedia indexes and then the responses link to the Dbpedia on the languag=
e
> of the text. For example, if the text is on Spanish language then the
> literals founded have relations to resources to the Spanish Dbpedia (not
> English Dbepdia resources).
> And if its possible could you explain me how to do it.

The disambiguation-mlt engine is not language specific. Principally it
works with any Entityhub Site and any language where a disambiguation
context is available.

AFAIK the currently hard coded configuration uses the full-text field
(that contains texts in any lanugages) for the Solr MLT query. The
1Gbyte Solr index you probably use for disambiguation includes short
abstracts only for English. Long abstracts are not included for any
language. This is also the reason why you are not getting
disambiguation results for other languages as English.

A better suited environment would provide short (or even long)
abstracts for the language you want to disambiguate. The configuration
of the Engine would not use the all-language full text field for the
MLT queries, but instead the language specific one. The reason why
such information are not included in the distributed index is simple
to reduce its size. In addition when this index was created there was
not yet an engine such as the disambiguation-mlt one that would have
consumed those information.

I have already created an DBpedia 3.8 based index that includes a lot
of information useful for disambiguation for several languages.
However this index in its current form is not easily shared as it is
about ~100GByte (45Gbyte compressed) is size. In addition I had not
yet time to validate the index (as indexing only completed shortly
before I left for ApacheCon last week). Anyway I will use this index
as base for further work on the disambiguation-mlt engine. I will also
share the used Entityhub indexing tool configuration and try to come
up with an modified configuration that is about 10GByte in size but
still useful for disambiguation with the MLT based engine.

best
Rupert

>
> That's all! and Thank you very much again!
>
> Best,
>
> Jairo Sarabia


--=20
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstra=C3=9Fe 11                             ++43-699-11108907
| A-5500 Bischofshofen