Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36197 invoked from network); 22 Jun 2006 14:49:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 22 Jun 2006 14:49:40 -0000 Received: (qmail 72803 invoked by uid 500); 22 Jun 2006 14:49:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72700 invoked by uid 500); 22 Jun 2006 14:49:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72661 invoked by uid 99); 22 Jun 2006 14:49:26 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jun 2006 07:49:25 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of Lawrence@theladders.com designates 67.151.144.115 as permitted sender) Received: from [67.151.144.115] (HELO hermes.laddersoffice.com) (67.151.144.115) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jun 2006 07:49:23 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C6960B.008AF056" Subject: RE: Lucene and SIPs Date: Thu, 22 Jun 2006 10:49:02 -0400 Message-ID: <15FC71D155A0F8429C3636B72C2ED52E19BF39@hermes.laddersoffice.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Lucene and SIPs thread-index: AcaVkL2nmor4fsAaRiKQdhu/TCSvswAeJFVA From: "Larry Ogrodnek" To: "Nader Akhnoukh" Cc: X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------_=_NextPart_001_01C6960B.008AF056 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I didn't make too much progress, and kind of ended up dropping it. =20 One thing that I played with was creating multiple phrase indexes, one each for 2, 3, 4, and 5 words. I wrote a tokenizer that would batch up the words, so, for the input string: =20 The quick brown fox jumps over the slow lazy dog. =20 The tokenizer for 3 words would return: =20 The quick brown Quick brown fox Brown fox jumps Fox jumps over ... =20 This seemed like a reasonably start... the problem is resolving the overlap for display, and figuring out which words are the most important, e.g. if the above sentence itself was pretty rare, and you're looking at the phrase-index-3, each one of its sub-phrases would end up being significant.... Which one do you show? Or do you combine them into a longer phrase? If so, where do you stop? =20 It seemed like an easy first-approach to try out, but I'm not sure it's even in the right direction... =20 =20 =20 =20 ________________________________ From: Nader Akhnoukh [mailto:iamnader@gmail.com]=20 Sent: Wednesday, June 21, 2006 8:14 PM To: Larry Ogrodnek Subject: Lucene and SIPs =20 Hi Lawrence, I saw a posting to the Lucene group you made in February concerning using Lucene to find SIPs. Did you make any progress with this? I'm able to find significant single terms, but am stumped by phrases.=20 Thanks, Nader ------_=_NextPart_001_01C6960B.008AF056--