Return-Path: X-Original-To: apmail-stanbol-dev-archive@www.apache.org Delivered-To: apmail-stanbol-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA45410D08 for ; Fri, 27 Sep 2013 13:52:23 +0000 (UTC) Received: (qmail 73575 invoked by uid 500); 27 Sep 2013 13:52:23 -0000 Delivered-To: apmail-stanbol-dev-archive@stanbol.apache.org Received: (qmail 73338 invoked by uid 500); 27 Sep 2013 13:52:18 -0000 Mailing-List: contact dev-help@stanbol.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@stanbol.apache.org Delivered-To: mailing list dev@stanbol.apache.org Received: (qmail 73316 invoked by uid 99); 27 Sep 2013 13:52:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Sep 2013 13:52:16 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rupert.westenthaler@gmail.com designates 74.125.82.43 as permitted sender) Received: from [74.125.82.43] (HELO mail-wg0-f43.google.com) (74.125.82.43) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Sep 2013 13:52:12 +0000 Received: by mail-wg0-f43.google.com with SMTP id z12so2757005wgg.22 for ; Fri, 27 Sep 2013 06:51:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=mh+AvLzj3Dv7gNvXT2VQ4AwaPqgTkXMKYvsI20TVJUc=; b=WvH+0mypKpjBalves++s6DXyEw/qIUZ1TPgo/jasHVjG5iu84SYsC3Irpw786cqp34 XTrtslyoJZmpEUKN4uZshGP9FLVqa/pHLaXFGFZxDECxIW+NLs3TgPbnfqdyUi1r1pD+ h2r0Obgl+bXmQpJZdYB+9GUWl16iTImskDtiHiY6bbJTDvrp/PeAzmyUCf1JfyuAqyTs HhrplneXVWDG0J9R7wcsCXFLCZ83fiDTzS718LTep2bvawoCfipBGQAYXYRiMoweyu4S pO2O8Fdt0Cg7gslY5Y0H6cN1WeUX+SI83wh4Ylulki+YEvM6SEeBC3wX0Kmf81fse9b7 XCzw== MIME-Version: 1.0 X-Received: by 10.180.20.46 with SMTP id k14mr2245254wie.39.1380289911295; Fri, 27 Sep 2013 06:51:51 -0700 (PDT) Received: by 10.216.181.138 with HTTP; Fri, 27 Sep 2013 06:51:51 -0700 (PDT) In-Reply-To: References: Date: Fri, 27 Sep 2013 15:51:51 +0200 Message-ID: Subject: Re: FST Linking Engine (STANBOL-1128) From: Rupert Westenthaler To: "dev@stanbol.apache.org" , dsmiley@apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi all with an update to the FST Linking engine. In the last weeks I have invested a lot of time in further improvements of the Engine and contributions to the upstream project SolrTextTagger [1]. Since STANBOL-1153 Solr Schema used the Entityhub are compatible with the FST linking engine. That means that Entityhub Sites using the new schema can be used with the FST linking engine. With STANBOL-1155 also the dbpedia default data index is compatible with the FST linking engine. The FST linking engine now also supports to recalculate FST models after changes to the SolrCore (e.g. when you add/update/delete an Entity of an Entityhub ManagedSite). However note that this is an expensive operation that can take some time. During the creation the engine still uses the old model. Recreation is done in lowest priority background threads. . The engine depends on SolrTextTagger 1.2-SNAPSHOT. Until Pull Request 17 [4] is merged the build requires to use my branch [5]. I hope that David Smiley agrees with me that version 1.2 can soon be released and added to maven central. When this is done I will include the FST linking engine to the default build, the launchers and also will add integration tests for it. Unit tests are already present. Because the engine is now also easier to use for "custom vocabularies" - vocabularies with typically 10k-500k entities I have made some benchmark test that do compare the FST linking engine with current EntityLinkingEngine. While the test with Freebase (36 million entities) have shown a 5 times better performance. The performance gains for such smaller vocabularies where in the area of 50-100 times faster. The reason for that is that FST linking can be done fully in memory for vocabularies of that size, while the Solr Query based EntityLinkingEngine does see only minimal performance gains for smaller vocabularies. See the detailed test results below: On Fri, Aug 23, 2013 at 6:18 PM, Rupert Westenthaler wrote: > Initial Performance Tests: > ---- > > I performed a Test on my MacBook Pro Core i7 2.6GHz, SSD with sending > 5k dbpedia long abstracts with 10 concurrent threads with the Enhancer > Stress Test Tool [3] to chains that included Language detection, > OpenNLP Token, Sentence and POS tagging and > > (A) FST linking engine configured for Freebase with a Document Cache > size of 1 million vs. > (B) EntityLinking engine also configured for freebase. > > with > > (A) average of 70ms for FST linking (with 100% CPU) > (B) average of 390ms for EntityLinking > > when doing the test with ProperNoun linking deactivated (basically > also linking Common Nouns to simulate longer texts) it gives the > following results: > > (A) average of 267ms for FST linking (with 100% CPU) > (B) average of 1417ms for EntityLinking > > In both cases the FST linking engine is about 5 times faster as the > currently used EntityLinking engine. > Made some additional tests with smaller Vocabularies. Especially those where all Entities can be cached in the LRU cache for SolrDocuments. Setup: * Hardware Setup was the exact same as for the initial tests. * 10k dbpedia long abstracts with 10 concurrent threads * Vocabulary: the dbpedia default data index (~25k entities). * Label: Instead of "rdfs:label" linking was dome against "dbpedia-ont:surfaceForm". This property is containing the label of the Entity as well as all labels of Redirects to that entity (A) FST linging with a cache that can hold all Entities (a feasible config for vocabularies with less as 1 million entities) (B) EntityLinking engine With ProperNoun Linking configuration: (A) average of 5ms for FST linking (B) average of 339ms for EntityLinking When configured to link all Nouns (A) average of 7ms for FST linking (B) average of 994ms for EntityLinking best Rupert > > > [1] https://github.com/OpenSextant/SolrTextTagger/ > [2] http://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/luc= enefstlinking/README.md > [3] http://stanbol.apache.org/docs/trunk/utils/enhancerstresstest [4] https://github.com/OpenSextant/SolrTextTagger/pull/17 [5] https://github.com/westei/SolrTextTagger > > -- > | Rupert Westenthaler rupert.westenthaler@gmail.com > | Bodenlehenstra=C3=9Fe 11 ++43-699-11108907 > | A-5500 Bischofshofen --=20 | Rupert Westenthaler rupert.westenthaler@gmail.com | Bodenlehenstra=C3=9Fe 11 ++43-699-11108907 | A-5500 Bischofshofen