From java-user-return-63928-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Sun Aug  5 01:04:15 2018
Return-Path: <java-user-return-63928-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 25D57180626
	for <archive-asf-public@cust-asf.ponee.io>; Sun,  5 Aug 2018 01:04:14 +0200 (CEST)
Received: (qmail 28912 invoked by uid 500); 4 Aug 2018 23:04:13 -0000
Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:java-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:java-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:java-user@lucene.apache.org>
List-Id: <java-user.lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Delivered-To: mailing list java-user@lucene.apache.org
Received: (qmail 28891 invoked by uid 99); 4 Aug 2018 23:04:13 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Aug 2018 23:04:13 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A1EEC181265
	for <java-user@lucene.apache.org>; Sat,  4 Aug 2018 23:04:12 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.869
X-Spam-Level: *
X-Spam-Status: No, score=1.869 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01]
	autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id ZC7M2o1P-tCd for <java-user@lucene.apache.org>;
	Sat,  4 Aug 2018 23:04:10 +0000 (UTC)
Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com [209.85.220.170])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 02CE75F1EE
	for <java-user@lucene.apache.org>; Sat,  4 Aug 2018 23:04:10 +0000 (UTC)
Received: by mail-qk0-f170.google.com with SMTP id 13-v6so3859002qkl.9
        for <java-user@lucene.apache.org>; Sat, 04 Aug 2018 16:04:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=gO+p4EUdD1sBQJLcQsS7JnT7HeuVrwXEDETcVmWQDGc=;
        b=KTAKvlQ61XIFUUNR5plUmMHnD9lfKjRhJ9o0pgUP7idekBRZLDGPJEW/OlYxENf3bo
         V+Wuhys3WDLBKiPXsFgQOvwkBMOvYej7TkR/GPE6q20kMhE/szzl5YII4dPpAXKhXyEy
         4rYipG1FeZN/g9KW/MxIgpwdaFet/kPRrxq7mSJI2IezakqgxE8I6lG5qY5WfvPKdnJB
         KwzHuaNnyAo2X3T7EfmlSPdHSbyxhs3D+rdgATkdCRO/IoW/qkm7iQHEP06Nwy0tmfGs
         htRj+njmaDenPZZR9BhKM+m8g3G7a+nEd1ZgHViAHASvxLXZwIY6YimxuRSpNZM4kjt6
         3XEA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=gO+p4EUdD1sBQJLcQsS7JnT7HeuVrwXEDETcVmWQDGc=;
        b=ncDSD3LZOZJuH6Agmc/8u79UWA4niHmLObKh4seyOs5uTiIQnp5ghNO5gT33fumazq
         +664rjIGjWvPlI53+ZxgfZrHEtit6jZaiPPw5e6QkFR5mJ+MF5RCq18TEopx5Zh2QPPD
         2D0IgZm9DVvDE93NWu2Tqa+JWybiEJy+JDT4Yot8BhThr0z0OincdkCwsrA9sUdO+4iL
         UwNDCYPFrLc7oze5wMmsPRBB9oERy23MzuZrL/4JAnWjEecn/FzOrZxqRL5/pFRtbZ2W
         6KSazkylSphSs6gVGwjwdUzfG8Mmkju+fqpYUtmMiRStI5IqZmpyfw8DYzknbI9fzmhn
         GBcQ==
X-Gm-Message-State: AOUpUlHYiWFwKWJJ7H4HbMpwAt9OtPDtuLJmEv6N2B4GjmWJyH5Zc2dS
	z7Ox5SEvfAujbBK+kAK6s9emuCB/mAZulWe4qRdnUNb8
X-Google-Smtp-Source: AA+uWPw6Krsh7vTIxrtx5WYPxkp4DOwcrQw5OF4IjRBAhae69uN1ACNVaGkIrJFe7ijalg0xAxZ+AOyAUTXDeyYEqH0=
X-Received: by 2002:a37:ac19:: with SMTP id e25-v6mr4328777qkm.389.1533423848709;
 Sat, 04 Aug 2018 16:04:08 -0700 (PDT)
MIME-Version: 1.0
References: <CAGUSZHA07gf_ARbrbe60oPBoT7mJXVvJH4pgHSBryjudX+VacA@mail.gmail.com>
 <CAL8PwkYcg346uZLGkM15PeBdvYRXbWyL9M9EqOhaMJuvC5Sp+g@mail.gmail.com>
 <CAGUSZHBEpJcF6eUdfVrKbLQy+mKjbduO2CeUTkuaywAynE+fvw@mail.gmail.com>
 <CAOdYfZUtHYnijOR1711Qt+bdHyztTQ7UTDRroSxiHefuNVMnWw@mail.gmail.com> <CAGUSZHA_gAAYQO80G42K31b0jJ5oeks7GisXAaqQo_4fVLh3AQ@mail.gmail.com>
In-Reply-To: <CAGUSZHA_gAAYQO80G42K31b0jJ5oeks7GisXAaqQo_4fVLh3AQ@mail.gmail.com>
From: Michael Sokolov <msokolov@gmail.com>
Date: Sat, 4 Aug 2018 16:03:56 -0700
Message-ID: <CAGUSZHBNbRVB7riHy_VD_9yD-bntFPfRZ71Smr7uMRKzMLuEmw@mail.gmail.com>
Subject: Re: offsets
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="000000000000cae36b0572a40f10"

--000000000000cae36b0572a40f10
Content-Type: text/plain; charset="UTF-8"

OK, so I thought some more concrete evidence might be helpful to make the
case here and did a quick POC. To get access to precise within-token
offsets we do need to make some changes to the public API, but the profile
could be kept small. In the version I worked up, I extracted the character
offset mapping implementation from BaseCharFilter into a separate
CharOffsetMap interface/class and added these new public methods to
existing classes:

TokenStream.getCharOffsetMap()

CharFilter.uncorrect(int correctOffset)  (pseudo-inverse of correct --
returns the left-most offset in the current character coordinates that
corresponds to the given original character offset)

The CharOffsetMap interface has just two methods correctOffset and
uncorrectOffset that support the offset mapping in both CharFilter and
TokenStream

To fully support setting offsets in TokenFilters we need (at least
something like) this inverse offset-correction method (uncorrect) because
OffsetAttribute's offsets are in the original "correct" character
coordinates, but token lengths in incrementToken() are in filtered ("not
correct") character space, and are not anchored to the origin so cannot be
converted directly.

I recognize the impact is not huge here, but we do have TokenFilters that
split tokens, and a currently trappy OffsetAttribute API. Personally I
think it makes sense to acknowledge that and make it a first class citizen,
but I guess another alternative (for fixing the trappiness) would be to
make OffsetAttribute unmodifiable. I know that either approach would have
saved me hours of confusion as I tried to correctly implement offsets.


On Wed, Aug 1, 2018 at 8:57 AM Michael Sokolov <msokolov@gmail.com> wrote:

> Given that character transformations do happen in TokenFilters, shouldn't
> we strive to have an API that supports correct offsets (ie highlighting)
> for any combination of token filters? Currently we can't do that. For
> example because of the current situation, WordDelimiterGraphFilter,
> decompounding filters and the like cannot assign offsets correctly, so eg
> it becomes impossible to highlight the text that exactly corresponds to the
> user query.
>
> Just one example, if I have URLs in some document text, and analysis chain
> is Whitespace tokenizer followed by WordDelimiterGraphFilter, then a query
> for "http" will end up highlighting the entire URL.
>
> Do you have an idea how we can address this without making our apis crazy?
> Or are you just saying we should live with it as it is?
>
> -Mike
>
>
> On Tue, Jul 31, 2018 at 6:36 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> The problem is not a performance one, its a complexity thing. Really I
>> think only the tokenizer should be messing with the offsets...
>> They are the ones actually parsing the original content so it makes
>> sense they would produce the pointers back to them.
>> I know there are some tokenfilters out there trying to be tokenizers,
>> but we don't need to make our apis crazy to support that.
>>
>> On Mon, Jul 30, 2018 at 11:53 PM, Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> > Yes, in fact Tokenizer already provides correctOffset which just
>> delegates
>> > to CharFilter. We could expand on this, moving correctOffset up to
>> > TokenStream, and also adding correct() so that TokenFilters can add to
>> the
>> > character offset data structure (two int arrays) and share it across the
>> > analysis chain.
>> >
>> > Implementation-wise this could continue to delegate to CharFilter I
>> guess,
>> > but I think it would be better to add a character-offset-map abstraction
>> > that wraps the two int arrays and provides the correct/correctOffset
>> > methods to both TokenStream and CharFilter.
>> >
>> > This would let us preserve correct offsets in the face of manipulations
>> > like replacing ellipses, ligatures (like AE, OE), trademark symbols
>> > (replaced by "tm") and the like so that we can have the invariant that
>> > correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length()
>> ==
>> > correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
>> > token-splitting with correct offsets.
>> >
>> > I can work up a proof of concept; I don't think it would be too
>> > API-intrusive or change performance in a significant way.  Only
>> > TokenFilters that actually care about this (ie that insert or remove
>> > characters, or split tokens) would need to change; others would
>> continue to
>> > work as-is.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

--000000000000cae36b0572a40f10--