Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D171218DA3 for ; Tue, 25 Aug 2015 22:29:40 +0000 (UTC) Received: (qmail 90092 invoked by uid 500); 25 Aug 2015 22:29:35 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 90020 invoked by uid 500); 25 Aug 2015 22:29:35 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 90007 invoked by uid 99); 25 Aug 2015 22:29:35 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Aug 2015 22:29:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id ACCE4EDD98 for ; Tue, 25 Aug 2015 22:29:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.101 X-Spam-Level: X-Spam-Status: No, score=-0.101 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id i2o3PUaTv4cK for ; Tue, 25 Aug 2015 22:29:34 +0000 (UTC) Received: from mail-io0-f177.google.com (mail-io0-f177.google.com [209.85.223.177]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id B3C0A25573 for ; Tue, 25 Aug 2015 22:29:33 +0000 (UTC) Received: by iodv127 with SMTP id v127so204937015iod.3 for ; Tue, 25 Aug 2015 15:29:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Abmig/DQ/YudM+6UYWfiP6GE9fGBbAO78is6ehJ2mvI=; b=VUDsfML+u7IiWLmfkqhGX9oea+9J3Hy8ZUvDHQEZPjMj7+E26oUeZy/+1/5SzI5Cch cpcYG+Za/S8Whg0UOY0/GyMsjMgdUcqqrVnelPlA5YMuMmOnMJBAg3EWNsoRoP5q4uF5 KmYDRiFe5uPz1/5/+gnZR1Grh4zAX6b76un+/ZognjEQ7LBm76Vgme5ESAeehw17KvEn 3wbCfx0g+TngAg7N1AmRmgPcU+97iW3SsUrjXn6Ie/IOARHoUmpSbUwJu3lYtdLo/3Bj eyJ4t+bpB7IzrBABMF9rjwFOCHYNkolf4Wlj8pUfbjVTRwEF5QQPB544zV9WUMk7t26g YCfg== MIME-Version: 1.0 X-Received: by 10.107.129.160 with SMTP id l32mr25426570ioi.158.1440541773195; Tue, 25 Aug 2015 15:29:33 -0700 (PDT) Received: by 10.107.53.201 with HTTP; Tue, 25 Aug 2015 15:29:33 -0700 (PDT) In-Reply-To: References: Date: Tue, 25 Aug 2015 15:29:33 -0700 Message-ID: Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Well, you're going down a path that hasn't been trodden before ;). If you can treat your primitive types as text types you might get some traction, but that makes a lot of operations like numeric comparison difficult. Hmmmm. another idea from left field. For single-valued types, what about a sidecar field that has the auth token? And even for a multiValued field, two parallel fields are guaranteed to maintain order so perhaps you could do something here. Yes, I'm waving my hands a LOT here..... I suspect that trying to have a custom type that incorporates payloads for, say, trie fields will be "interesting" to say the least. Numeric types are packed to save storage etc. so it'll be an adventure.. Best, Erick On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson wrote: > We were originally using this approach, i.e. run things through the > KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter. Again > this works fine for text, though I had wanted to use the StandardTokenizer > in the chain. Is there an equivalent filter that does what the > StandardTokenizer does? > > All of this said this doesn't address the issue of the primitive field > types, which at this point is the bigger issue. Given this use case should > there be another way to provide payloads? > > My current thinking is that I will need to provide custom implementations > for all of the field types I would like to support payloads on which will > essentially be copies of the standard versions with some extra "sugar" to > read/write the payloads (I don't see a way to wrap/delegate these at this > point because AttributeSource has the attribute retrieval related methods > as final so I can't simply wrap another tokenizer and return my added > attributes + the wrapped attributes). I know my use case is a bit strange, > but I had not expected to need to do this given that Lucene/Solr supports > payloads on these field types, they just aren't exposed. > > As always I appreciate any ideas if I'm barking up the wrong tree here. > > On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma > wrote: > >> Well, if i remember correctly (i have no testing facility at hand) >> WordDelimiterFilter maintains payloads on emitted sub terms. So if you use >> a KeywordTokenizer, input 'some text^PAYLOAD', and have a >> DelimitedPayloadFilter, the entire string gets a payload. You can then >> split that string up again in individual tokens. It is possible to abuse >> WordDelimiterFilter for it because it has a types parameter that you can >> use to split it on whitespace if its input is not trimmed. Otherwise you >> can use any other character instead of a space as your input. >> >> This is a crazy idea, but it might work. >> >> -----Original message----- >> > From:Jamie Johnson >> > Sent: Tuesday 25th August 2015 19:37 >> > To: solr-user@lucene.apache.org >> > Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory >> > >> > To be clear, we are using payloads as a way to attach authorizations to >> > individual tokens within Solr. The payloads are normal Solr Payloads >> > though we are not using floats, we are using the identity payload encoder >> > (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for >> > storing a byte[] of our choosing into the payload field. >> > >> > This works great for text, but now that I'm indexing more than just text >> I >> > need a way to specify the payload on the other field types. Does that >> make >> > more sense? >> > >> > On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson < >> erickerickson@gmail.com> >> > wrote: >> > >> > > This really sounds like an XY problem. Or when you use >> > > "payload" it's not the Solr payload. >> > > >> > > So Solr Payloads are a float value that you can attach to >> > > individual terms to influence the scoring. Attaching the >> > > _same_ payload to all terms in a field is much the same >> > > thing as boosting on any matches in the field at query time >> > > or boosting on the field at index time (this latter assuming >> > > that different docs would have different boosts). >> > > >> > > So can you back up a bit and tell us what you're trying to >> > > accomplish maybe we can be sure we're both talking about >> > > the same thing ;) >> > > >> > > Best, >> > > Erick >> > > >> > > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson >> wrote: >> > > > I would like to specify a particular payload for all tokens emitted >> from >> > > a >> > > > tokenizer, but don't see a clear way to do this. Ideally I could >> specify >> > > > that something like the DelimitedPayloadTokenFilter be run on the >> entire >> > > > field and then standard analysis be done on the rest of the field, >> so in >> > > > the case that I had the following text >> > > > >> > > > this is a test\Foo >> > > > >> > > > I would like to create tokens "this", "is", "a", "test" each with a >> > > payload >> > > > of Foo. From what I'm seeing though only test get's the payload. Is >> > > there >> > > > anyway to accomplish this or will I need to implement a custom >> tokenizer? >> > > >> > >>