lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Using POS payloads for chunking
Date Wed, 14 Jun 2017 22:05:35 GMT
Hello Tommaso,

These don't propagate to search right, but can be used in the analyzer chain! This would be
a better solution than using delimiters on words. The only problem is that TypeFilter only
works on Tokens, after the tokenizer. The bonus of a CharFilter is that is sees the whole
text, so OpenNLP can digest it all at once. Downside is that a CharFilter cannot set TypeAttribute
because there are no tokens yet.

If we would try that option, we would have to build a TokenFilter that understands the whole
text at once, because that is what OpenNLP needs, not single tokens. This is difficult, so
we chose the option of a CharFilter plus a TokenFilter. This is not ideal but i find it very
hard to digest whole text in a TokenFilter. See Shingle and CommonGrams, these are very complicated
filters.

How would you overcome this problem? For NLP you need all text at once, which CharFilter provides.
But that won't allow you to set TypeAttribute. Perhaps i am missing something completely and
am stupid, probably :)

Thanks,
Markus
 
-----Original message-----
> From:Tommaso Teofili <tommaso.teofili@gmail.com>
> Sent: Wednesday 14th June 2017 23:49
> To: java-user@lucene.apache.org
> Subject: Re: Using POS payloads for chunking
> 
> I think it'd be interesting to also investigate using TypeAttribute [1]
> together with TypeTokenFilter [2].
> 
> Regards,
> Tommaso
> 
> [1] :
> https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html
> [2] :
> https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html
> 
> Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma <
> markus.jelsma@openindex.io> ha scritto:
> 
> > Hello Erick, no worries, i recognize you two.
> >
> > I will take a look at your references tomorrow. Although i am still fine
> > with eight bits, i cannot spare any more but one. If Lucene allows us to
> > pass longer bitsets to the BytesRef, it would be awesome and easy to encode.
> >
> > Thanks!
> > Markus
> >
> > -----Original message-----
> > > From:Erick Erickson <erickerickson@gmail.com>
> > > Sent: Wednesday 14th June 2017 23:29
> > > To: java-user <java-user@lucene.apache.org>
> > > Subject: Re: Using POS payloads for chunking
> > >
> > > Markus:
> > >
> > > I don't believe that payloads are limited in size at all. LUCENE-7705
> > > was done in part because there _was_ a hard-coded 256 limit for some
> > > of the tokenizers. The Payload (at least recent versions) just have
> > > some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> > >
> > > Of course if you put anything other than a number in there you have to
> > > provide your own decoders and the like to make sense of your
> > > payload....
> > >
> > > Best,
> > > Erick (Erickson, not Hatcher)
> > >
> > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
> > > <markus.jelsma@openindex.io> wrote:
> > > > Hello Erik,
> > > >
> > > > Using Solr, or actually more parts are Lucene, we have a CharFilter
> > adding treebank tags to whitespace delimited word using a delimiter,
> > further on we get these tokens with the delimiter and the POS-tag. It won't
> > work with some Tokenizers and put it before WDF, it'll split as you know.
> > That TokenFilter is configured with a tab delimited mapping config
> > containing <POS-tag>\t<bitset>, and there the bitset is encoded as payload.
> > > >
> > > > Our edismax extension rewrites queries to payload supported
> > equivalents, this is quite trivial, except for all those API changes in
> > Lucene you have to put up with. Finally a BM25 extension that has, amongst
> > others, a mapping of bitset to score. Nouns get a bonus, prepositions and
> > other useless pieces get a punishment etc.
> > > >
> > > > Payloads are really great things to use! We also use it to distinguish
> > between compounds and their subwords, o.a. we supply Dutch and German
> > speaking countries.  And stemmed words and non-stemmed words. Although the
> > latter also benefit from IDF statistics, payloads just help to control
> > boosting more precisely regardless of your corpus.
> > > >
> > > > I still need to take a look at your recent payload QParsers for Solr
> > and see how different, probably better, they are compared to our older
> > implementations. Although we don't use PayloadTermQParser equivalent for
> > regular search, we do use it for scoring recommendations via delimited
> > multi valued fields. Payloads are versatile!
> > > >
> > > > The downside of payloads is that they are limited to 8 bits. Although
> > we can easily fit our reduced treebank in there, we also use single bits to
> > signal for compound/subword, and stemmed/unstemmed and some others.
> > > >
> > > > Hope this helps.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > > -----Original message-----
> > > >> From:Erik Hatcher <erik.hatcher@gmail.com>
> > > >> Sent: Wednesday 14th June 2017 23:03
> > > >> To: java-user@lucene.apache.org
> > > >> Subject: Re: Using POS payloads for chunking
> > > >>
> > > >> Markus - how are you encoding payloads as bitsets and use them for
> > scoring?   Curious to see how folks are leveraging them.
> > > >>
> > > >>       Erik
> > > >>
> > > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <
> > markus.jelsma@openindex.io> wrote:
> > > >> >
> > > >> > Hello,
> > > >> >
> > > >> > We use POS-tagging too, and encode them as payload bitsets for
> > scoring, which is, as far as is know, the only possibility with payloads.
> > > >> >
> > > >> > So, instead of encoding them as payloads, why not index your
> > treebanks POS-tags as tokens on the same position, like synonyms. If you do
> > that, you can use spans and phrase queries to find chunks of multiple
> > POS-tags.
> > > >> >
> > > >> > This would be the first approach i can think of. Treating them
as
> > regular tokens enables you to use regular search for them.
> > > >> >
> > > >> > Regards,
> > > >> > Markus
> > > >> >
> > > >> >
> > > >> >
> > > >> > -----Original message-----
> > > >> >> From:José Tomás Atria <jtatria@gmail.com>
> > > >> >> Sent: Wednesday 14th June 2017 22:29
> > > >> >> To: java-user@lucene.apache.org
> > > >> >> Subject: Using POS payloads for chunking
> > > >> >>
> > > >> >> Hello!
> > > >> >>
> > > >> >> I'm not particularly familiar with lucene's search api (as
I've
> > been using
> > > >> >> the library mostly as a dumb index rather than a search engine),
> > but I am
> > > >> >> almost certain that, using its payload capabilities, it would
be
> > trivial to
> > > >> >> implement a regular chunker to look for patterns in sequences
of
> > payloads.
> > > >> >>
> > > >> >> (trying not to be too pedantic, a regular chunker looks for
> > 'chunks' based
> > > >> >> on part-of-speech tags, e.g. noun phrases can be searched
for with
> > patterns
> > > >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant
and
> > zero or
> > > >> >> more adjectives preceding a bunch of nouns, etc)
> > > >> >>
> > > >> >> Assuming my index has POS tags encoded as payloads for each
> > position, how
> > > >> >> would one search for such patterns, irrespective of terms?
I
> > started
> > > >> >> studying the spans search API, as this seemed like the natural
> > place to
> > > >> >> start, but I quickly got lost.
> > > >> >>
> > > >> >> Any tips would be extremely appreciated. (or references to
this
> > kind of
> > > >> >> thing, I'm sure someone must have tried something similar
> > before...)
> > > >> >>
> > > >> >> thanks!
> > > >> >> ~jta
> > > >> >> --
> > > >> >>
> > > >> >> sent from a phone. please excuse terseness and tpyos.
> > > >> >>
> > > >> >> enviado desde un teléfono. por favor disculpe la parquedad
y los
> > erroers.
> > > >> >>
> > > >> >
> > > >> >
> > ---------------------------------------------------------------------
> > > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >> >
> > > >>
> > > >>
> > > >> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message