Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC9451010E for ; Sat, 7 Sep 2013 11:44:39 +0000 (UTC) Received: (qmail 49527 invoked by uid 500); 7 Sep 2013 11:44:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 49129 invoked by uid 500); 7 Sep 2013 11:44:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 49119 invoked by uid 99); 7 Sep 2013 11:44:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Sep 2013 11:44:26 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of benson@basistech.com designates 209.85.220.179 as permitted sender) Received: from [209.85.220.179] (HELO mail-vc0-f179.google.com) (209.85.220.179) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Sep 2013 11:44:22 +0000 Received: by mail-vc0-f179.google.com with SMTP id ht10so2822789vcb.38 for ; Sat, 07 Sep 2013 04:44:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=basistech.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=m6wK5BjVzk81tQSgYGSOVbNHbM0TSAhEVQjmBRVaF/o=; b=MnBWYM1P7OBkB8ghwR2ux55wC0pb4fhaxhJvxb7k2zP5J3FRKD8HjQ1aqysPoUgm3Z 1XrKVPl7Q6V5iQc1hpAq4sNv9MIKP5os1L4vjA88hVRyiak/T6wHMLp1BJmwAB8ZSDaz 8ydxmdoKGot2HG+T++rOzzcCedETimmw6VbcU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=m6wK5BjVzk81tQSgYGSOVbNHbM0TSAhEVQjmBRVaF/o=; b=l+FIO8pkOLUzgK3thtk5omgc5ynkEJDFgFf9bkmrrntI0r+ZgzR7X3wHlyFNJExvW9 c1EKEBXPtRcZ4Y5hjblUclN0AnsLjoMBw407HL56/izRLhuf53hJx7/nHNd81OGjBcrj nY2y5h1gYSnIe9cICIdlWYF157esT2m+TcMsS3DP3knWpiSxxZmskop+urG+3YJr8G5E /gEQvkPNG+yeKhPoD/MJd3/oBxp/gft4VhLjfsebKznbmmSC2Ok5Gne8DB9wY65FkbHi 4a52RfTWHqR3iVDi/PuxGDyLLJ6ABpeVczsLrTgJMKXvaMHtROlrbFoHeU3LALL2glCt oyJQ== X-Gm-Message-State: ALoCoQnAkKn1HPlg/OKLnaephWDNTuI0Wmdl1zGATdsrd+Iz0M889AzvP+jkTufH7RgkZm/MCKd1 MIME-Version: 1.0 X-Received: by 10.58.217.167 with SMTP id oz7mr7276295vec.15.1378554241166; Sat, 07 Sep 2013 04:44:01 -0700 (PDT) Received: by 10.52.109.166 with HTTP; Sat, 7 Sep 2013 04:44:01 -0700 (PDT) In-Reply-To: References: Date: Sat, 7 Sep 2013 07:44:01 -0400 Message-ID: Subject: Re: PositionLengthAttribute From: Benson Margulies To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org In Japanese, compounds are just decompositions of the input string. In other languages, compounds can manufacture entire tokens from thin air. In those cases, it's something of a question how to decide on the offsets. I think that you're right, eventually, insofar as there's some offset in the original that might as well be blamed for any given component. On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir wrote: > On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies wrote: >> On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir wrote: >>> its the latter. the way its designed to work i think is illustrated >>> best in kuromoji analyzer where it heuristically decompounds nouns: >>> >>> if it decompounds ABCD into AB + CD, then the tokens are AB and CD. >>> these both have posinc=1. >>> however (to compensate for precision issue you mentioned on the other >>> thread), it keeps the full compound as a synonym too (there are some >>> papers benchmarking this approach for decompounding, just think of IDF >>> etc sorting things out). >>> so that ABCD synonym has position increment 0, and it "sits" at the >>> same position as the first token (AB). but it has positionLength=2, >>> which basically keeps the information in the chain that this "synonym" >>> spans across both AB and CD. >>> >>> so the output is like this: AB(posinc=1,posLength=1), >>> ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1) >> >> I suppose this works best if you actually know the offsets of the >> pieces. In disassembling German, this is not always straightforward. >> > > i dont really see how it has anything to do with natural languages? > its just the way you represent the compound components in the > tokenstream. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org