lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Tomás Atria <jtat...@gmail.com>
Subject Re: Using POS payloads for chunking
Date Thu, 15 Jun 2017 17:37:12 GMT
Ah, good to know!

I'm actually using lower level calls, as I'm building the TokenStream by
hand from UIMA annotations and not using any analyzer, but I'll keep that
in mind for uture projects. Thanks!

On Thu, Jun 15, 2017 at 12:10 PM Erick Erickson <erickerickson@gmail.com>
wrote:

> José:
>
> Do note that, while the bytearray isn't limited, prior to LUCENE-7705
> most of the tokenizers you would use limited the incoming token to 256
> at most. This is not at all a _Lucene_ limitation at a low level,
> rather if you're indexing data with a delimited payload (say
> abc|your_payload_here) the tokenizer would chop it off when the whole
> thing reached 256 chars.
>
> Hmmm, still confusing. Say the input to the analysis chain was
> abc|512_byes_of_payload_data
> The tokenizer would give you
>
> abc|frst_252_bytes
>
> But if you're using lower-level Lucene calls directly that limit doesn't
> apply.
>
> Best,
> Erick
>
> On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria <jtatria@gmail.com>
> wrote:
> > Hi Markus, thanks for your response!
> >
> > Now I feel stupid, that is clearly a much simpler approach and it has the
> > added benefits that it would not require me to meddle into the scoring
> > process, which I'm still a bit terrified of. Thanks for the tip.
> >
> > I guess the question is still valid though? i.e. how would one take into
> > account payloads for scoring entire spans? Does this make sense at all?
> Any
> > links to a more-or-less straightforward example?
> >
> > On the length of payloads: I understood that you have other restrictions,
> > but payloads take a bytesref as value, so you can encode arbitrary data
> in
> > them as long as you encode and decode properly. E.g. you could encode the
> > long array that backs a fixed bitset as a bytesref and pass that, though
> > I'm not sure it would be efficient unless you have at least 64 flags.
> >
> > thanks!
> > jta
> >
> >
> >
> > On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma <
> markus.jelsma@openindex.io>
> > wrote:
> >
> >> Hello,
> >>
> >> We use POS-tagging too, and encode them as payload bitsets for scoring,
> >> which is, as far as is know, the only possibility with payloads.
> >>
> >> So, instead of encoding them as payloads, why not index your treebanks
> >> POS-tags as tokens on the same position, like synonyms. If you do that,
> you
> >> can use spans and phrase queries to find chunks of multiple POS-tags.
> >>
> >> This would be the first approach i can think of. Treating them as
> regular
> >> tokens enables you to use regular search for them.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:José Tomás Atria <jtatria@gmail.com>
> >> > Sent: Wednesday 14th June 2017 22:29
> >> > To: java-user@lucene.apache.org
> >> > Subject: Using POS payloads for chunking
> >> >
> >> > Hello!
> >> >
> >> > I'm not particularly familiar with lucene's search api (as I've been
> >> using
> >> > the library mostly as a dumb index rather than a search engine), but
> I am
> >> > almost certain that, using its payload capabilities, it would be
> trivial
> >> to
> >> > implement a regular chunker to look for patterns in sequences of
> >> payloads.
> >> >
> >> > (trying not to be too pedantic, a regular chunker looks for 'chunks'
> >> based
> >> > on part-of-speech tags, e.g. noun phrases can be searched for with
> >> patterns
> >> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero
> or
> >> > more adjectives preceding a bunch of nouns, etc)
> >> >
> >> > Assuming my index has POS tags encoded as payloads for each position,
> how
> >> > would one search for such patterns, irrespective of terms? I started
> >> > studying the spans search API, as this seemed like the natural place
> to
> >> > start, but I quickly got lost.
> >> >
> >> > Any tips would be extremely appreciated. (or references to this kind
> of
> >> > thing, I'm sure someone must have tried something similar before...)
> >> >
> >> > thanks!
> >> > ~jta
> >> > --
> >> >
> >> > sent from a phone. please excuse terseness and tpyos.
> >> >
> >> > enviado desde un teléfono. por favor disculpe la parquedad y los
> erroers.
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> --
> >
> > sent from a phone. please excuse terseness and tpyos.
> >
> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --

sent from a phone. please excuse terseness and tpyos.

enviado desde un teléfono. por favor disculpe la parquedad y los erroers.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message