Return-Path: X-Original-To: apmail-opennlp-dev-archive@www.apache.org Delivered-To: apmail-opennlp-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40811927B for ; Tue, 17 Apr 2012 13:20:40 +0000 (UTC) Received: (qmail 72661 invoked by uid 500); 17 Apr 2012 13:20:40 -0000 Delivered-To: apmail-opennlp-dev-archive@opennlp.apache.org Received: (qmail 72629 invoked by uid 500); 17 Apr 2012 13:20:40 -0000 Mailing-List: contact dev-help@opennlp.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@opennlp.apache.org Delivered-To: mailing list dev@opennlp.apache.org Received: (qmail 72620 invoked by uid 99); 17 Apr 2012 13:20:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Apr 2012 13:20:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jasonbaldridge@gmail.com designates 209.85.210.54 as permitted sender) Received: from [209.85.210.54] (HELO mail-pz0-f54.google.com) (209.85.210.54) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Apr 2012 13:20:32 +0000 Received: by dady13 with SMTP id y13so11576179dad.27 for ; Tue, 17 Apr 2012 06:20:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=UoY5hyWe/4lAfx6m6S6DtHX/jhOxlXA9Kpl+eqHn9L8=; b=YTG9KI4O7nT1XFqgT5euV1k+xBd61aZmbrcEJl9f2F6CKBcPEgpcgVBFsJBlOlrLGy VsvyDSy1qZxXlT0yb45XluD8qge6ScisOvGfk5pDbbXacMBz7OYlog3vL7Vu+LbDBQoK MMgvYwaNc73FMXgg8O1yVatAQqm6OfW6LyqA7HLMDP4Bu9/PwoDfVXME+AYtDdhMa0Et 9lkuyPSA8hPiWRn4WueUYfamoTg/a54axOXD7vRpx4++5dGBp5bYRUDCRTf3yuVCQ/1c F9yI2w7I2Ip1UWhJRSCZAjX+ldHVlGgpup8c4qqPyuCeAkrlBEjDnQAw4H0IKBHBmBbl UWew== MIME-Version: 1.0 Received: by 10.68.213.162 with SMTP id nt2mr36177789pbc.130.1334668810477; Tue, 17 Apr 2012 06:20:10 -0700 (PDT) Received: by 10.142.108.1 with HTTP; Tue, 17 Apr 2012 06:20:10 -0700 (PDT) Reply-To: jbaldrid@mail.utexas.edu In-Reply-To: <4F8D6A36.5010504@gmail.com> References: <4F8D6455.5060200@gmail.com> <4F8D6791.1030308@gmail.com> <4F8D6A36.5010504@gmail.com> Date: Tue, 17 Apr 2012 08:20:10 -0500 Message-ID: Subject: Re: Merging the output of multiple name finders From: Jason Baldridge To: dev@opennlp.apache.org Content-Type: multipart/alternative; boundary=e89a8ff24e95575efd04bddfcc97 --e89a8ff24e95575efd04bddfcc97 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I haven't followed this in detail, but I do wonder why we don't have a single model that just predicts all the types? That is the standard thing to do... FWIW, integrating the output of multiple classifiers and incorporating their probabilities is something that can be done quite cleanly with approaches like Integer Linear Programming. Jason On Tue, Apr 17, 2012 at 8:03 AM, J=F6rn Kottmann wrote= : > I propose that we make a simple baseline implementations > which takes all output spans, orders them and then resolves > the ambiguities based on the order. This will prefer longer > names over shorter names, but ignores the type. > > There are more sophisticated ways of handling this, > e.g taking probabilities from the statistical name finders into > account, but these might be a bit more restrictive as well. > > Its always good to have some simple baseline, to see how much > something more complicated improves it. > > Any opinions? > > J=F6rn > > > On 04/17/2012 02:52 PM, J=F6rn Kottmann wrote: > >> If you don't want to handle these cases, you can simply copy all names >> together >> into a list, and then do evaluation on this list. >> This approach works with our evaluation, but will usually be an issue fo= r >> applications which expect output >> where the ambiguities mentioned earlier are resolved. >> >> J=F6rn >> >> On 04/17/2012 02:38 PM, Jim - FooBar(); wrote: >> >>> Ok first of all you're referring to the final merging >>> (AggregateNameFinder) and not the multiple dictionaries where no mergin= g >>> occurs...anyway let's deal with this at the moment. let's see... >>> >>>> - Two names can be identical and have the same type or a different typ= e >>>> >>> Well if the type is different the spans are not identical (equal) so >>> you keep both and do some reasoning over them (see below). >>> If they type is the same and the spans cover the same text then they ar= e >>> equal so you only keep one of them. >>> >>>> - Two names have intersecting spans >>>> >>> It is very unlikely that both are correct so in the simplest case of >>> keeping them both you may lose some precision. However considering how >>> often that could happen it becomes unimportant. Or you could do some >>> reasoning (see below) again if they have the same type. If they don't h= ave >>> the same type then why not keep them both again? >>> >>> - One name is contained in another like this: >>>> a b c d >>>> >>> well, this is exactly the same case as before conceptually. If they hav= e >>> the same type it's very likely that one is wrong.You can do the same so= rt >>> of reasoning as above. If they don't there is no way to know with >>> confidence what to do so i say keep them both. >>> >>> the reasoning i'm referring to is simply to *trust the dictionary* (if >>> one exists). If one doesn't exist and one is trying to merge results fr= om >>> several maxent models for example, then we cannot make an informed >>> decision. It is only the dictionary that can provide facts. all the res= t >>> are probabilities... >>> >>> Jim >>> >>> >>> Hi all, >>>> >>>> in one of the jiras we started a discussion about merging the output >>>> of multiple name finders and which conflicts exist. >>>> Lets move it back to the dev list. >>>> >>>> The merging code needs to handle these cases: >>>> >>>> - Two names can be identical and have the same type or a different typ= e. >>>> >>>> - Two names have intersecting spans like this: >>>> a b c d >>>> >>>> - One name is contained in another like this: >>>> a b c d >>>> >>>> Depending on the use case and merging logic it might be resolved >>>> differently. >>>> >>>> J=F6rn >>>> >>> >>> >>> >> > --=20 Jason Baldridge Associate Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge --e89a8ff24e95575efd04bddfcc97--