lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Date Tue, 25 Aug 2015 22:29:33 GMT
Well, you're going down a path that hasn't been trodden before ;).

If you can treat your primitive types as text types you might get
some traction, but that makes a lot of operations like numeric
comparison difficult.

Hmmmm. another idea from left field. For single-valued types,
what about a sidecar field that has the auth token? And even
for a multiValued field, two parallel fields are guaranteed to
maintain order so perhaps you could do something here. Yes,
I'm waving my hands a LOT here.....

I suspect that trying to have a custom type that incorporates
payloads for, say, trie fields will be "interesting" to say the least.
Numeric types are packed to save storage etc. so it'll be
an adventure..

Best,
Erick

On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson <jej2003@gmail.com> wrote:
> We were originally using this approach, i.e. run things through the
> KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter.  Again
> this works fine for text, though I had wanted to use the StandardTokenizer
> in the chain.  Is there an equivalent filter that does what the
> StandardTokenizer does?
>
> All of this said this doesn't address the issue of the primitive field
> types, which at this point is the bigger issue.  Given this use case should
> there be another way to provide payloads?
>
> My current thinking is that I will need to provide custom implementations
> for all of the field types I would like to support payloads on which will
> essentially be copies of the standard versions with some extra "sugar" to
> read/write the payloads (I don't see a way to wrap/delegate these at this
> point because AttributeSource has the attribute retrieval related methods
> as final so I can't simply wrap another tokenizer and return my added
> attributes + the wrapped attributes).  I know my use case is a bit strange,
> but I had not expected to need to do this given that Lucene/Solr supports
> payloads on these field types, they just aren't exposed.
>
> As always I appreciate any ideas if I'm barking up the wrong tree here.
>
> On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma <markus.jelsma@openindex.io>
> wrote:
>
>> Well, if i remember correctly (i have no testing facility at hand)
>> WordDelimiterFilter maintains payloads on emitted sub terms. So if you use
>> a KeywordTokenizer, input 'some text^PAYLOAD', and have a
>> DelimitedPayloadFilter, the entire string gets a payload. You can then
>> split that string up again in individual tokens. It is possible to abuse
>> WordDelimiterFilter for it because it has a types parameter that you can
>> use to split it on whitespace if its input is not trimmed. Otherwise you
>> can use any other character instead of a space as your input.
>>
>> This is a crazy idea, but it might work.
>>
>> -----Original message-----
>> > From:Jamie Johnson <jej2003@gmail.com>
>> > Sent: Tuesday 25th August 2015 19:37
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
>> >
>> > To be clear, we are using payloads as a way to attach authorizations to
>> > individual tokens within Solr.  The payloads are normal Solr Payloads
>> > though we are not using floats, we are using the identity payload encoder
>> > (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
>> > storing a byte[] of our choosing into the payload field.
>> >
>> > This works great for text, but now that I'm indexing more than just text
>> I
>> > need a way to specify the payload on the other field types.  Does that
>> make
>> > more sense?
>> >
>> > On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> > > This really sounds like an XY problem. Or when you use
>> > > "payload" it's not the Solr payload.
>> > >
>> > > So Solr Payloads are a float value that you can attach to
>> > > individual terms to influence the scoring. Attaching the
>> > > _same_ payload to all terms in a field is much the same
>> > > thing as boosting on any matches in the field at query time
>> > > or boosting on the field at index time (this latter assuming
>> > > that different docs would have different boosts).
>> > >
>> > > So can you back up a bit and tell us what you're trying to
>> > > accomplish maybe we can be sure we're both talking about
>> > > the same thing ;)
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson <jej2003@gmail.com>
>> wrote:
>> > > > I would like to specify a particular payload for all tokens emitted
>> from
>> > > a
>> > > > tokenizer, but don't see a clear way to do this.  Ideally I could
>> specify
>> > > > that something like the DelimitedPayloadTokenFilter be run on the
>> entire
>> > > > field and then standard analysis be done on the rest of the field,
>> so in
>> > > > the case that I had the following text
>> > > >
>> > > > this is a test\Foo
>> > > >
>> > > > I would like to create tokens "this", "is", "a", "test" each with
a
>> > > payload
>> > > > of Foo.  From what I'm seeing though only test get's the payload.
 Is
>> > > there
>> > > > anyway to accomplish this or will I need to implement a custom
>> tokenizer?
>> > >
>> >
>>

Mime
View raw message