Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7DFC5200CA7 for ; Wed, 14 Jun 2017 23:49:12 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7C79A160BDB; Wed, 14 Jun 2017 21:49:12 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9ACA5160BD6 for ; Wed, 14 Jun 2017 23:49:11 +0200 (CEST) Received: (qmail 49914 invoked by uid 500); 14 Jun 2017 21:49:08 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 49884 invoked by uid 99); 14 Jun 2017 21:49:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Jun 2017 21:49:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A94A11AFD72 for ; Wed, 14 Jun 2017 21:49:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id uqGVkqhaoiej for ; Wed, 14 Jun 2017 21:49:03 +0000 (UTC) Received: from mail-qt0-f174.google.com (mail-qt0-f174.google.com [209.85.216.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 068745F36F for ; Wed, 14 Jun 2017 21:49:03 +0000 (UTC) Received: by mail-qt0-f174.google.com with SMTP id u12so17536790qth.0 for ; Wed, 14 Jun 2017 14:49:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=0esZ16ytWcL+G29pWrYn2eJv84EqS9SF1CNBRh9yLIA=; b=buySy+dgba1bCZZOtmfvMb221Dg7IQBa/CrfWXKlhpOvQluGAg8rnE2DatCbaa5tdz 4xVh4seHkV2L3/ZEm8sHhLRwa/NTYRsOg0k6XFnEAszAyCqkUqHYJEqQgaT/xB1U/M11 oTkrhMt3ciZBPaUikLqlkecW/LEsD1yHa9tyxvqfA3Hqp5y2Q6PnpOQP0gPR4EQu9hEz 2hHBk3GDkrDBCRRHwFTlkn+WI5etZzKMf6XRKLi16z+gPiU9bLVkdzuH7cwF+M9BGbX+ 47cai/DsUUHjsL1UnFQXl13mHFIlIXwZBg0ytV+BdraegIMAOKTbF0DM8Nl4iGCBkuaR walA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=0esZ16ytWcL+G29pWrYn2eJv84EqS9SF1CNBRh9yLIA=; b=Dw2GpI+C/DB80ptcwu2pRVL7iVTs8HhKjVx2HVRd1xKVSIzti1loEplGiolos0Seuh veEBsh2LhVOosLyqeD7Lan7Guo0eko2m8du4l4CuQHI/AlE8Yd8iH6HEI33sxXc6g3aY FmPaj1KhB4lcaaJHjr+UlYnZ62ej0JVQkFsTMtY+r1s82e5DJCm/1dC0/yQ9qJEFpWEB UefaOlI3YugNlNTqHmYf7Dr6u35Cu4hZzyh6GRv3VflKaJA6fwZmpes3I9pU3TDXq59i gS/nWr/0dsBqKDNR1mXwuVkS+mYHVWYYWKd33Rmem+EEU36Cu0+LYqUL02LzCyeuo5HL mKxQ== X-Gm-Message-State: AKS2vOxN6tQ0lZwp46MhX1kFGeStoab+GELOp1KbLl+M6i2hGQ5iQSwP rBbbsGsgDdQWpwheRLK5hY3kdt6XZSBb X-Received: by 10.200.50.188 with SMTP id z57mr2685368qta.114.1497476941791; Wed, 14 Jun 2017 14:49:01 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Tommaso Teofili Date: Wed, 14 Jun 2017 21:48:50 +0000 Message-ID: Subject: Re: Using POS payloads for chunking To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary="001a1136f0402cc6b70551f28538" archived-at: Wed, 14 Jun 2017 21:49:12 -0000 --001a1136f0402cc6b70551f28538 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I think it'd be interesting to also investigate using TypeAttribute [1] together with TypeTokenFilter [2]. Regards, Tommaso [1] : https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokena= ttributes/TypeAttribute.html [2] : https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/ana= lysis/core/TypeTokenFilter.html Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma < markus.jelsma@openindex.io> ha scritto: > Hello Erick, no worries, i recognize you two. > > I will take a look at your references tomorrow. Although i am still fine > with eight bits, i cannot spare any more but one. If Lucene allows us to > pass longer bitsets to the BytesRef, it would be awesome and easy to enco= de. > > Thanks! > Markus > > -----Original message----- > > From:Erick Erickson > > Sent: Wednesday 14th June 2017 23:29 > > To: java-user > > Subject: Re: Using POS payloads for chunking > > > > Markus: > > > > I don't believe that payloads are limited in size at all. LUCENE-7705 > > was done in part because there _was_ a hard-coded 256 limit for some > > of the tokenizers. The Payload (at least recent versions) just have > > some bytes after them, and (with LUCENE-7705) can be arbitrarily long. > > > > Of course if you put anything other than a number in there you have to > > provide your own decoders and the like to make sense of your > > payload.... > > > > Best, > > Erick (Erickson, not Hatcher) > > > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > > wrote: > > > Hello Erik, > > > > > > Using Solr, or actually more parts are Lucene, we have a CharFilter > adding treebank tags to whitespace delimited word using a delimiter, > further on we get these tokens with the delimiter and the POS-tag. It won= 't > work with some Tokenizers and put it before WDF, it'll split as you know. > That TokenFilter is configured with a tab delimited mapping config > containing \t, and there the bitset is encoded as payloa= d. > > > > > > Our edismax extension rewrites queries to payload supported > equivalents, this is quite trivial, except for all those API changes in > Lucene you have to put up with. Finally a BM25 extension that has, amongs= t > others, a mapping of bitset to score. Nouns get a bonus, prepositions and > other useless pieces get a punishment etc. > > > > > > Payloads are really great things to use! We also use it to distinguis= h > between compounds and their subwords, o.a. we supply Dutch and German > speaking countries. And stemmed words and non-stemmed words. Although th= e > latter also benefit from IDF statistics, payloads just help to control > boosting more precisely regardless of your corpus. > > > > > > I still need to take a look at your recent payload QParsers for Solr > and see how different, probably better, they are compared to our older > implementations. Although we don't use PayloadTermQParser equivalent for > regular search, we do use it for scoring recommendations via delimited > multi valued fields. Payloads are versatile! > > > > > > The downside of payloads is that they are limited to 8 bits. Although > we can easily fit our reduced treebank in there, we also use single bits = to > signal for compound/subword, and stemmed/unstemmed and some others. > > > > > > Hope this helps. > > > > > > Regards, > > > Markus > > > > > > -----Original message----- > > >> From:Erik Hatcher > > >> Sent: Wednesday 14th June 2017 23:03 > > >> To: java-user@lucene.apache.org > > >> Subject: Re: Using POS payloads for chunking > > >> > > >> Markus - how are you encoding payloads as bitsets and use them for > scoring? Curious to see how folks are leveraging them. > > >> > > >> Erik > > >> > > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma < > markus.jelsma@openindex.io> wrote: > > >> > > > >> > Hello, > > >> > > > >> > We use POS-tagging too, and encode them as payload bitsets for > scoring, which is, as far as is know, the only possibility with payloads. > > >> > > > >> > So, instead of encoding them as payloads, why not index your > treebanks POS-tags as tokens on the same position, like synonyms. If you = do > that, you can use spans and phrase queries to find chunks of multiple > POS-tags. > > >> > > > >> > This would be the first approach i can think of. Treating them as > regular tokens enables you to use regular search for them. > > >> > > > >> > Regards, > > >> > Markus > > >> > > > >> > > > >> > > > >> > -----Original message----- > > >> >> From:Jos=C3=A9 Tom=C3=A1s Atria > > >> >> Sent: Wednesday 14th June 2017 22:29 > > >> >> To: java-user@lucene.apache.org > > >> >> Subject: Using POS payloads for chunking > > >> >> > > >> >> Hello! > > >> >> > > >> >> I'm not particularly familiar with lucene's search api (as I've > been using > > >> >> the library mostly as a dumb index rather than a search engine), > but I am > > >> >> almost certain that, using its payload capabilities, it would be > trivial to > > >> >> implement a regular chunker to look for patterns in sequences of > payloads. > > >> >> > > >> >> (trying not to be too pedantic, a regular chunker looks for > 'chunks' based > > >> >> on part-of-speech tags, e.g. noun phrases can be searched for wit= h > patterns > > >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and > zero or > > >> >> more adjectives preceding a bunch of nouns, etc) > > >> >> > > >> >> Assuming my index has POS tags encoded as payloads for each > position, how > > >> >> would one search for such patterns, irrespective of terms? I > started > > >> >> studying the spans search API, as this seemed like the natural > place to > > >> >> start, but I quickly got lost. > > >> >> > > >> >> Any tips would be extremely appreciated. (or references to this > kind of > > >> >> thing, I'm sure someone must have tried something similar > before...) > > >> >> > > >> >> thanks! > > >> >> ~jta > > >> >> -- > > >> >> > > >> >> sent from a phone. please excuse terseness and tpyos. > > >> >> > > >> >> enviado desde un tel=C3=A9fono. por favor disculpe la parquedad y= los > erroers. > > >> >> > > >> > > > >> > > --------------------------------------------------------------------- > > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > >> > For additional commands, e-mail: java-user-help@lucene.apache.org > > >> > > > >> > > >> > > >> --------------------------------------------------------------------= - > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > >> For additional commands, e-mail: java-user-help@lucene.apache.org > > >> > > >> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --001a1136f0402cc6b70551f28538--