Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 56961 invoked from network); 19 Feb 2008 00:37:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Feb 2008 00:37:01 -0000 Received: (qmail 61139 invoked by uid 500); 19 Feb 2008 00:36:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61106 invoked by uid 500); 19 Feb 2008 00:36:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61095 invoked by uid 99); 19 Feb 2008 00:36:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Feb 2008 16:36:49 -0800 X-ASF-Spam-Status: No, hits=-2.0 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ghinwa@csail.mit.edu designates 128.30.2.149 as permitted sender) Received: from [128.30.2.149] (HELO outgoing.csail.mit.edu) (128.30.2.149) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Feb 2008 00:36:03 +0000 Received: from c-65-96-166-47.hsd1.ma.comcast.net ([65.96.166.47] helo=IBM3D2E684396F) by outgoing.csail.mit.edu with esmtpa (Exim 4.63) (envelope-from ) id 1JRGT8-000413-Uz for java-user@lucene.apache.org; Mon, 18 Feb 2008 19:36:23 -0500 Message-ID: <04aa01c8728f$72eca3e0$0b02a8c0@IBM3D2E684396F> From: "Ghinwa Choueiter" To: Subject: How to index word-pairs and phrases Date: Mon, 18 Feb 2008 19:36:21 -0500 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_04A7_01C87265.89F2C030" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.3138 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 X-Virus-Checked: Checked by ClamAV on apache.org ------=_NextPart_000_04A7_01C87265.89F2C030 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi, I am new to Lucene and have been reading the documentation. I would like = to use Lucene to query a song database by lyrics. The query could = potentially contain typos, or even wrong words, word contractions (can't = versus cannot), etc.. I would like to create an inverted list by word pairs and possibly = phrases and not just by isolated words. For example: < d1, d10, d27> ... OR even <...> ... It seems to me that, by default, the index in Lucene stores statistics = for isolated words. The Lucene documentation refers to the word "Term" = all the time and seems to imply that "Term" can be a word or a phrase, = but I can't see how IndexWriter can read a document and index it by word = pairs.=20 thank you in advance for the answers and my apologies if I did not get = the terminology quite right. -Ghinwa ------=_NextPart_000_04A7_01C87265.89F2C030--