From java-user-return-41075-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Fri Jul 03 19:03:47 2009 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 62145 invoked from network); 3 Jul 2009 19:03:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Jul 2009 19:03:46 -0000 Received: (qmail 47171 invoked by uid 500); 3 Jul 2009 19:03:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 47094 invoked by uid 500); 3 Jul 2009 19:03:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 47083 invoked by uid 99); 3 Jul 2009 19:03:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jul 2009 19:03:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.217.215 as permitted sender) Received: from [209.85.217.215] (HELO mail-gx0-f215.google.com) (209.85.217.215) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jul 2009 19:03:44 +0000 Received: by gxk11 with SMTP id 11so3534992gxk.5 for ; Fri, 03 Jul 2009 12:03:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=FE55lptq42p33l8cRXaKbMRdYva0vccfO5HSJaVhqYo=; b=bxmrB4IqymYsp7k5Nsng12gCvA2QXsxMDh4/sLdJoLB4uNtyRQjxS4OsqC2ldjCqx7 NhLDmK4jQXvH9+37FGluCnyrWu3dXVrwT6dfcdV6u9gvjvwLyERf0FyxNtixQTGCq3VM iVK04OdXyOpUIfHbxIVbS07L3/EiWDLn0ZbHA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=KEx4IH89xX6T1LAAOUA/o/lcEmpglnZ903x1UqKhTqJyoMKHd8PEguLRmFTY9hmhYM eXh8jUic8/WVDUkLrdqvi9hboVkCjBNms0c8CPaIjY5Ky5dfivVRm27Mr6RJoE3t4mj5 ZhuOcBcdJu0zpts80DnXKH2RZb+DpRZLdAOe0= MIME-Version: 1.0 Received: by 10.231.11.135 with SMTP id t7mr1335409ibt.12.1246647803240; Fri, 03 Jul 2009 12:03:23 -0700 (PDT) In-Reply-To: <70422ecc0907030927q6d58229dqeadbeda5365e3f5c@mail.gmail.com> References: <70422ecc0907030927q6d58229dqeadbeda5365e3f5c@mail.gmail.com> Date: Fri, 3 Jul 2009 15:03:23 -0400 Message-ID: <359a92830907031203j5c31d22yc644397dd5d8606a@mail.gmail.com> Subject: Re: How to use RegexTermEnum From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0022152d654178e412046dd1cf62 X-Virus-Checked: Checked by ClamAV on apache.org --0022152d654178e412046dd1cf62 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit WARNING: I haven't actually tried using RegexTermEnum in a long time, but... I *think* that the constructor positions you at the first term that matches, without calling next(). At least there's nothing I saw in the documentation that indicates you need to call next() before calling term(). Assuming that's true, I think you're skipping the first term by calling next() before incrementing your count. At least it's worth a try .... Best Erick On Fri, Jul 3, 2009 at 12:27 PM, Raf wrote: > Hi, > I am trying to solve the following problem: > In my index I have a "url" field added as Field.Store.YES, > Field.Index.NOT_ANALYZED and I must use this field as a "key" to identify a > document. > > The problem is that sometimes two urls can differ only because they contain > a different session id: > i.e. I would like to identify that > > http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab02505591827a90fe5010f45c#3432879 > and > > http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab505d98a8229c10fe5010f45c#3432879 > are the same document! > > So I have tried using a regular expression, to ignore the sid and match > both > documents: "http://digiland > \\.libero\\.it/forum/viewtopic\\.php\\?p=3432879\\&.*#3432879". > > At this point, I would like to retrieve all terms that satisfy my regex so > I > tried to use a RegexTermEnum, but it returns to me only one of the two > documents. > Actually, it seems to me that it does not return the "first" match. > So, if I have only one match in my index, RegexTermEnum returns nothing, if > I have two matches, it returns one doc, and so on. > > Here you can find a simple test that shows the problem (both assert fail): > > > package it.celi.search; > > import static org.junit.Assert.assertEquals; > > import java.io.IOException; > > import org.apache.lucene.analysis.KeywordAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.index.Term; > import org.apache.lucene.index.IndexWriter.MaxFieldLength; > import org.apache.lucene.search.regex.JakartaRegexpCapabilities; > import org.apache.lucene.search.regex.RegexTermEnum; > import org.apache.lucene.store.Directory; > import org.apache.lucene.store.RAMDirectory; > import org.junit.After; > import org.junit.Before; > import org.junit.Test; > > public class RegexLuceneTest { > > private Directory directory; > > @Before > public void setUp() throws Exception { > > this.directory = new RAMDirectory(); > this.addDocsToIndex(); > } > > @After > public void tearDown() throws Exception { > } > > @Test > public void test() throws IOException { > > IndexReader reader = IndexReader.open(this.directory); > System.out.println("Num docs: " + reader.numDocs()); > > JakartaRegexpCapabilities regexpCapabilities = new > JakartaRegexpCapabilities(); > > String urlToSearch = "http://digiland > \\.libero\\.it/forum/viewtopic\\.php\\?p=3432889\\&.*#3432889"; > RegexTermEnum rte = new RegexTermEnum(reader, new Term("url", > urlToSearch), regexpCapabilities); > int count = 0; > while (rte.next()) { > System.out.println(rte.term() + " " + rte.docFreq()); > count++; > } > assertEquals(1, count); > > urlToSearch = "http://digiland > \\.libero\\.it/forum/viewtopic\\.php\\?p=3432879\\&.*#3432879"; > rte = new RegexTermEnum(reader, new Term("url", urlToSearch), > regexpCapabilities); > count = 0; > while (rte.next()) { > System.out.println(rte.term() + " " + rte.docFreq()); > count++; > } > assertEquals(2, count); > > } > > private void addDocsToIndex() throws IOException { > > IndexWriter writer = new IndexWriter(directory, new > KeywordAnalyzer(), true, MaxFieldLength.UNLIMITED); > > Document doc = new Document(); > doc.add(new Field("url", " > > http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab02505591827a90fe5010f45c#3432879 > ", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > doc.add(new Field("contents", "contenuto documento 1", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > writer.addDocument(doc); > > doc = new Document(); > doc.add(new Field("url", " > > http://digiland.libero.it/forum/viewtopic.php?p=3432889&sid=16c7ea74d98a8229c1ddd4800a2738ec#3432889 > ", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > doc.add(new Field("contents", "contenuto documento 2", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > writer.addDocument(doc); > > doc = new Document(); > doc.add(new Field("url", " > > http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab505d98a8229c10fe5010f45c#3432879 > ", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > doc.add(new Field("contents", "contenuto documento 3", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > writer.addDocument(doc); > > writer.optimize(); > writer.close(); > } > > } > > > What am I missing? > Thanks. > > Bye, > Raf > --0022152d654178e412046dd1cf62--