Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
MIME-Version: 1.0
References: <zarafa.5941ab8b.2ff0.0722b77159f15c18@mail1.ams.nl.openindex.io>
In-Reply-To: <zarafa.5941ab8b.2ff0.0722b77159f15c18@mail1.ams.nl.openindex.io>
From: Tommaso Teofili <tommaso.teofili@gmail.com>
Date: Wed, 14 Jun 2017 21:48:50 +0000
Message-ID: <CAGnSx05Re5Jp8wZmafJRQO9Tvjv7PQpwUYSv2aJznkOkRJr_Ng@mail.gmail.com>
Subject: Re: Using POS payloads for chunking
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Content-Type: multipart/alternative; boundary="001a1136f0402cc6b70551f28538"
archived-at: Wed, 14 Jun 2017 21:49:12 -0000

--001a1136f0402cc6b70551f28538
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I think it'd be interesting to also investigate using TypeAttribute [1]
together with TypeTokenFilter [2].

Regards,
Tommaso

[1] :
https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokena=
ttributes/TypeAttribute.html
[2] :
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/ana=
lysis/core/TypeTokenFilter.html

Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma <
markus.jelsma@openindex.io> ha scritto:

> Hello Erick, no worries, i recognize you two.
>
> I will take a look at your references tomorrow. Although i am still fine
> with eight bits, i cannot spare any more but one. If Lucene allows us to
> pass longer bitsets to the BytesRef, it would be awesome and easy to enco=
de.
>
> Thanks!
> Markus
>
> -----Original message-----
> > From:Erick Erickson <erickerickson@gmail.com>
> > Sent: Wednesday 14th June 2017 23:29
> > To: java-user <java-user@lucene.apache.org>
> > Subject: Re: Using POS payloads for chunking
> >
> > Markus:
> >
> > I don't believe that payloads are limited in size at all. LUCENE-7705
> > was done in part because there _was_ a hard-coded 256 limit for some
> > of the tokenizers. The Payload (at least recent versions) just have
> > some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> >
> > Of course if you put anything other than a number in there you have to
> > provide your own decoders and the like to make sense of your
> > payload....
> >
> > Best,
> > Erick (Erickson, not Hatcher)
> >
> > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
> > <markus.jelsma@openindex.io> wrote:
> > > Hello Erik,
> > >
> > > Using Solr, or actually more parts are Lucene, we have a CharFilter
> adding treebank tags to whitespace delimited word using a delimiter,
> further on we get these tokens with the delimiter and the POS-tag. It won=
't
> work with some Tokenizers and put it before WDF, it'll split as you know.
> That TokenFilter is configured with a tab delimited mapping config
> containing <POS-tag>\t<bitset>, and there the bitset is encoded as payloa=
d.
> > >
> > > Our edismax extension rewrites queries to payload supported
> equivalents, this is quite trivial, except for all those API changes in
> Lucene you have to put up with. Finally a BM25 extension that has, amongs=
t
> others, a mapping of bitset to score. Nouns get a bonus, prepositions and
> other useless pieces get a punishment etc.
> > >
> > > Payloads are really great things to use! We also use it to distinguis=
h
> between compounds and their subwords, o.a. we supply Dutch and German
> speaking countries.  And stemmed words and non-stemmed words. Although th=
e
> latter also benefit from IDF statistics, payloads just help to control
> boosting more precisely regardless of your corpus.
> > >
> > > I still need to take a look at your recent payload QParsers for Solr
> and see how different, probably better, they are compared to our older
> implementations. Although we don't use PayloadTermQParser equivalent for
> regular search, we do use it for scoring recommendations via delimited
> multi valued fields. Payloads are versatile!
> > >
> > > The downside of payloads is that they are limited to 8 bits. Although
> we can easily fit our reduced treebank in there, we also use single bits =
to
> signal for compound/subword, and stemmed/unstemmed and some others.
> > >
> > > Hope this helps.
> > >
> > > Regards,
> > > Markus
> > >
> > > -----Original message-----
> > >> From:Erik Hatcher <erik.hatcher@gmail.com>
> > >> Sent: Wednesday 14th June 2017 23:03
> > >> To: java-user@lucene.apache.org
> > >> Subject: Re: Using POS payloads for chunking
> > >>
> > >> Markus - how are you encoding payloads as bitsets and use them for
> scoring?   Curious to see how folks are leveraging them.
> > >>
> > >>       Erik
> > >>
> > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <
> markus.jelsma@openindex.io> wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > We use POS-tagging too, and encode them as payload bitsets for
> scoring, which is, as far as is know, the only possibility with payloads.
> > >> >
> > >> > So, instead of encoding them as payloads, why not index your
> treebanks POS-tags as tokens on the same position, like synonyms. If you =
do
> that, you can use spans and phrase queries to find chunks of multiple
> POS-tags.
> > >> >
> > >> > This would be the first approach i can think of. Treating them as
> regular tokens enables you to use regular search for them.
> > >> >
> > >> > Regards,
> > >> > Markus
> > >> >
> > >> >
> > >> >
> > >> > -----Original message-----
> > >> >> From:Jos=C3=A9 Tom=C3=A1s Atria <jtatria@gmail.com>
> > >> >> Sent: Wednesday 14th June 2017 22:29
> > >> >> To: java-user@lucene.apache.org
> > >> >> Subject: Using POS payloads for chunking
> > >> >>
> > >> >> Hello!
> > >> >>
> > >> >> I'm not particularly familiar with lucene's search api (as I've
> been using
> > >> >> the library mostly as a dumb index rather than a search engine),
> but I am
> > >> >> almost certain that, using its payload capabilities, it would be
> trivial to
> > >> >> implement a regular chunker to look for patterns in sequences of
> payloads.
> > >> >>
> > >> >> (trying not to be too pedantic, a regular chunker looks for
> 'chunks' based
> > >> >> on part-of-speech tags, e.g. noun phrases can be searched for wit=
h
> patterns
> > >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and
> zero or
> > >> >> more adjectives preceding a bunch of nouns, etc)
> > >> >>
> > >> >> Assuming my index has POS tags encoded as payloads for each
> position, how
> > >> >> would one search for such patterns, irrespective of terms? I
> started
> > >> >> studying the spans search API, as this seemed like the natural
> place to
> > >> >> start, but I quickly got lost.
> > >> >>
> > >> >> Any tips would be extremely appreciated. (or references to this
> kind of
> > >> >> thing, I'm sure someone must have tried something similar
> before...)
> > >> >>
> > >> >> thanks!
> > >> >> ~jta
> > >> >> --
> > >> >>
> > >> >> sent from a phone. please excuse terseness and tpyos.
> > >> >>
> > >> >> enviado desde un tel=C3=A9fono. por favor disculpe la parquedad y=
 los
> erroers.
> > >> >>
> > >> >
> > >> >
> ---------------------------------------------------------------------
> > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >> >
> > >>
> > >>
> > >> --------------------------------------------------------------------=
-
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--001a1136f0402cc6b70551f28538--