Return-Path: X-Original-To: apmail-stanbol-dev-archive@www.apache.org Delivered-To: apmail-stanbol-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 11B9ADB18 for ; Sun, 11 Nov 2012 07:55:21 +0000 (UTC) Received: (qmail 18105 invoked by uid 500); 11 Nov 2012 07:55:20 -0000 Delivered-To: apmail-stanbol-dev-archive@stanbol.apache.org Received: (qmail 17394 invoked by uid 500); 11 Nov 2012 07:55:09 -0000 Mailing-List: contact dev-help@stanbol.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@stanbol.apache.org Delivered-To: mailing list dev@stanbol.apache.org Received: (qmail 17347 invoked by uid 99); 11 Nov 2012 07:55:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Nov 2012 07:55:07 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rupert.westenthaler@gmail.com designates 209.85.210.170 as permitted sender) Received: from [209.85.210.170] (HELO mail-ia0-f170.google.com) (209.85.210.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Nov 2012 07:55:01 +0000 Received: by mail-ia0-f170.google.com with SMTP id x24so36342iak.1 for ; Sat, 10 Nov 2012 23:54:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=4nV+2SrBUjKdXk7UnjdyF1y3MgR1KFyLW6r7QOI7jFQ=; b=PNMPGUK34d4e1yDEn9Q9kaHl7Ftjc/l+Cprb0C5udUU/3LjroMOYZWYQSekj4oCEc0 vbfOnFkGIQ4zWsRI3lDLfT+cHQEREQOmbGKRgXmo+cMpHoCR5J3/156oaHq2g+UFlcX5 m6cG6hBGYI0KoN89YkOJJJgFN/LIRIBDB1ZrCaG30pVISWPXnBQWqzGdrkDDeaoEmG52 jjlxow7QvDCakDs5jEnzaJXUPYeW5kOsk6G+vXQ+GN5iPyEETPzaKqXVAYZ6r77XDk/0 OuWdROwCP/aoTeydBgdeYuDpVod0Zp5EydL02zroPsg1KrjBSIX09vxLZGVffi+8vQqt uQBA== MIME-Version: 1.0 Received: by 10.50.151.166 with SMTP id ur6mr4903290igb.66.1352620480145; Sat, 10 Nov 2012 23:54:40 -0800 (PST) Received: by 10.50.8.102 with HTTP; Sat, 10 Nov 2012 23:54:40 -0800 (PST) In-Reply-To: References: Date: Sun, 11 Nov 2012 08:54:40 +0100 Message-ID: Subject: Re: Apache Stanbol: technical documentation and disambiguation From: Rupert Westenthaler To: dev@stanbol.apache.org, Jairo Sarabia , kritarth anand Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Jairo thanks for your feedback regarding the disambiguation engine On Fri, Nov 9, 2012 at 6:51 PM, Jairo Sarabia wrote: > I'm Jairo Sarabia, a web developer at Notedlinks S.L. from Barcelone > (Spain). > We're very interested on Apache Stanbol and we would like to know how > Stanbol works internally, so how works the framework is used, the directo= ry > structure and how works files of configuration. > Is there any documentation about these? Could you send me? > For the Stanbol Enhancer there is a Developer level documentation available= . http://stanbol.apache.org/docs/trunk/components/enhancer/ is the starting point. The Section "Main Interfaces and Utility Classes" links to the description of the different components. > Meanwhile, thank and congratulate you because we tested the disambiguatio= n > engine and we liked the improved responses in English, although I underst= and > that the quality is still regularly in some respects. Especially with top= ics > of Person and Organizations, so most times only detects part of the name = and > especially in compound words, and this makes the disambiguation is wrong. This is probably because the disambiguation Engine does not refine the fise:selected-text of the fise:TextAnnotation based on disambiguation results. Can you provide some examples of this behavior so that I can validate this assumption. > We would like to know about future plans for the disambiguation engine, a= nd > whether it can be used for other languages. Stanbol is a community driven Project. The engine itself was developed by Kritarth Anand in a GSoC project [1] and contributed to Stanbol with STANBOL-723 [2]. I am was mentoring this project. I do not know Kritarth plans, but personally I plan is to continue work on this engine as soon as I have finished - meaning re-integrated the Stanbol NLP module with the trunk. This work will mainly focus on making the MLT disambiguation engine configureable and testing that it works well with the new Stanbol NLP processing module (STANBOL-733). [1] http://www.google-melange.com/gsoc/project/google/gsoc2012/kritarth/120= 01 [2] https://issues.apache.org/jira/browse/STANBOL-723 > > Finally, we would like to know if it is possible to create multilingual > DBpedia indexes and then the responses link to the Dbpedia on the languag= e > of the text. For example, if the text is on Spanish language then the > literals founded have relations to resources to the Spanish Dbpedia (not > English Dbepdia resources). > And if its possible could you explain me how to do it. The disambiguation-mlt engine is not language specific. Principally it works with any Entityhub Site and any language where a disambiguation context is available. AFAIK the currently hard coded configuration uses the full-text field (that contains texts in any lanugages) for the Solr MLT query. The 1Gbyte Solr index you probably use for disambiguation includes short abstracts only for English. Long abstracts are not included for any language. This is also the reason why you are not getting disambiguation results for other languages as English. A better suited environment would provide short (or even long) abstracts for the language you want to disambiguate. The configuration of the Engine would not use the all-language full text field for the MLT queries, but instead the language specific one. The reason why such information are not included in the distributed index is simple to reduce its size. In addition when this index was created there was not yet an engine such as the disambiguation-mlt one that would have consumed those information. I have already created an DBpedia 3.8 based index that includes a lot of information useful for disambiguation for several languages. However this index in its current form is not easily shared as it is about ~100GByte (45Gbyte compressed) is size. In addition I had not yet time to validate the index (as indexing only completed shortly before I left for ApacheCon last week). Anyway I will use this index as base for further work on the disambiguation-mlt engine. I will also share the used Entityhub indexing tool configuration and try to come up with an modified configuration that is about 10GByte in size but still useful for disambiguation with the MLT based engine. best Rupert > > That's all! and Thank you very much again! > > Best, > > Jairo Sarabia --=20 | Rupert Westenthaler rupert.westenthaler@gmail.com | Bodenlehenstra=C3=9Fe 11 ++43-699-11108907 | A-5500 Bischofshofen