lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Using POS payloads for chunking
Date Wed, 14 Jun 2017 21:22:12 GMT
Hello Erik,

Using Solr, or actually more parts are Lucene, we have a CharFilter adding treebank tags to
whitespace delimited word using a delimiter, further on we get these tokens with the delimiter
and the POS-tag. It won't work with some Tokenizers and put it before WDF, it'll split as
you know. That TokenFilter is configured with a tab delimited mapping config containing <POS-tag>\t<bitset>,
and there the bitset is encoded as payload.

Our edismax extension rewrites queries to payload supported equivalents, this is quite trivial,
except for all those API changes in Lucene you have to put up with. Finally a BM25 extension
that has, amongst others, a mapping of bitset to score. Nouns get a bonus, prepositions and
other useless pieces get a punishment etc.

Payloads are really great things to use! We also use it to distinguish between compounds and
their subwords, o.a. we supply Dutch and German speaking countries.  And stemmed words and
non-stemmed words. Although the latter also benefit from IDF statistics, payloads just help
to control boosting more precisely regardless of your corpus.

I still need to take a look at your recent payload QParsers for Solr and see how different,
probably better, they are compared to our older implementations. Although we don't use PayloadTermQParser
equivalent for regular search, we do use it for scoring recommendations via delimited multi
valued fields. Payloads are versatile!

The downside of payloads is that they are limited to 8 bits. Although we can easily fit our
reduced treebank in there, we also use single bits to signal for compound/subword, and stemmed/unstemmed
and some others.

Hope this helps.

Regards,
Markus

-----Original message-----
> From:Erik Hatcher <erik.hatcher@gmail.com>
> Sent: Wednesday 14th June 2017 23:03
> To: java-user@lucene.apache.org
> Subject: Re: Using POS payloads for chunking
> 
> Markus - how are you encoding payloads as bitsets and use them for scoring?   Curious
to see how folks are leveraging them.
> 
> 	Erik
> 
> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> > 
> > Hello,
> > 
> > We use POS-tagging too, and encode them as payload bitsets for scoring, which is,
as far as is know, the only possibility with payloads.
> > 
> > So, instead of encoding them as payloads, why not index your treebanks POS-tags
as tokens on the same position, like synonyms. If you do that, you can use spans and phrase
queries to find chunks of multiple POS-tags.
> > 
> > This would be the first approach i can think of. Treating them as regular tokens
enables you to use regular search for them.
> > 
> > Regards,
> > Markus
> > 
> > 
> > 
> > -----Original message-----
> >> From:José Tomás Atria <jtatria@gmail.com>
> >> Sent: Wednesday 14th June 2017 22:29
> >> To: java-user@lucene.apache.org
> >> Subject: Using POS payloads for chunking
> >> 
> >> Hello!
> >> 
> >> I'm not particularly familiar with lucene's search api (as I've been using
> >> the library mostly as a dumb index rather than a search engine), but I am
> >> almost certain that, using its payload capabilities, it would be trivial to
> >> implement a regular chunker to look for patterns in sequences of payloads.
> >> 
> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based
> >> on part-of-speech tags, e.g. noun phrases can be searched for with patterns
> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
> >> more adjectives preceding a bunch of nouns, etc)
> >> 
> >> Assuming my index has POS tags encoded as payloads for each position, how
> >> would one search for such patterns, irrespective of terms? I started
> >> studying the spans search API, as this seemed like the natural place to
> >> start, but I quickly got lost.
> >> 
> >> Any tips would be extremely appreciated. (or references to this kind of
> >> thing, I'm sure someone must have tried something similar before...)
> >> 
> >> thanks!
> >> ~jta
> >> -- 
> >> 
> >> sent from a phone. please excuse terseness and tpyos.
> >> 
> >> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
> >> 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message