lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raf <r.ventag...@gmail.com>
Subject Re: How to use RegexTermEnum
Date Sat, 04 Jul 2009 16:01:46 GMT
Yes, I thought about this solution too, but the problem is that the "sid"
part can be different in different domains.
So, sometimes we have sid=..., other times we have s=.... and so on.

If we decide to solve the problem by removing the sid from the url in the
index, when we discover a new "pattern" (while we are using our system) we
will have to reindex the documents...

Using the regex approach, instead, we can configure the pattern we want to
identify for each domain and simply to change the configuration when we find
a new pattern.

Anyway, thank you for your suggestion.

Bye
Raf

On Sat, Jul 4, 2009 at 2:58 PM, Shayak Sen <shayaksen@gmail.com> wrote:

> I might be skirting the issue here, but wouldnt it be easier and
> faster if you remove the sid before you add it to the index?
>
> Cheers,
> Shayak
>
> On Sat, Jul 4, 2009 at 3:03 AM, Erick Erickson<erickerickson@gmail.com>
> wrote:
> > WARNING: I haven't actually tried using RegexTermEnum in a
> > long time, but...
> >
> > I *think* that the constructor positions you at the first term that
> > matches, without calling next(). At least there's nothing I saw
> > in the documentation that indicates you need to call next() before
> > calling term().
> >
> > Assuming that's true, I think you're skipping the first term by calling
> > next() before incrementing your count.
> >
> > At least it's worth a try <G>....
> >
> > Best
> > Erick
> >
> > On Fri, Jul 3, 2009 at 12:27 PM, Raf <r.ventaglio@gmail.com> wrote:
> >
> >> Hi,
> >> I am trying to solve the following problem:
> >> In my index I have a "url" field added as Field.Store.YES,
> >> Field.Index.NOT_ANALYZED and I must use this field as a "key" to
> identify a
> >> document.
> >>
> >> The problem is that sometimes two urls can differ only because they
> contain
> >> a different session id:
> >> i.e.  I would like to identify that
> >>
> >>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab02505591827a90fe5010f45c#3432879
> >> and
> >>
> >>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab505d98a8229c10fe5010f45c#3432879
> >> are the same document!
> >>
> >> So I have tried using a regular expression, to ignore the sid and match
> >> both
> >> documents: "http://digiland
> >> \\.libero\\.it/forum/viewtopic\\.php\\?p=3432879\\&.*#3432879".
> >>
> >> At this point, I would like to retrieve all terms that satisfy my regex
> so
> >> I
> >> tried to use a RegexTermEnum, but it returns to me only one of the two
> >> documents.
> >> Actually, it seems to me that it does not return the "first" match.
> >> So, if I have only one match in my index, RegexTermEnum returns nothing,
> if
> >> I have two matches, it returns one doc, and so on.
> >>
> >> Here you can find a simple test that shows the problem (both assert
> fail):
> >>
> >> <code>
> >> package it.celi.search;
> >>
> >> import static org.junit.Assert.assertEquals;
> >>
> >> import java.io.IOException;
> >>
> >> import org.apache.lucene.analysis.KeywordAnalyzer;
> >> import org.apache.lucene.document.Document;
> >> import org.apache.lucene.document.Field;
> >> import org.apache.lucene.index.IndexReader;
> >> import org.apache.lucene.index.IndexWriter;
> >> import org.apache.lucene.index.Term;
> >> import org.apache.lucene.index.IndexWriter.MaxFieldLength;
> >> import org.apache.lucene.search.regex.JakartaRegexpCapabilities;
> >> import org.apache.lucene.search.regex.RegexTermEnum;
> >> import org.apache.lucene.store.Directory;
> >> import org.apache.lucene.store.RAMDirectory;
> >> import org.junit.After;
> >> import org.junit.Before;
> >> import org.junit.Test;
> >>
> >> public class RegexLuceneTest {
> >>
> >>    private Directory directory;
> >>
> >>    @Before
> >>    public void setUp() throws Exception {
> >>
> >>        this.directory = new RAMDirectory();
> >>        this.addDocsToIndex();
> >>    }
> >>
> >>    @After
> >>    public void tearDown() throws Exception {
> >>    }
> >>
> >>    @Test
> >>    public void test() throws IOException {
> >>
> >>        IndexReader reader = IndexReader.open(this.directory);
> >>        System.out.println("Num docs: " + reader.numDocs());
> >>
> >>        JakartaRegexpCapabilities regexpCapabilities = new
> >> JakartaRegexpCapabilities();
> >>
> >>        String urlToSearch = "http://digiland
> >> \\.libero\\.it/forum/viewtopic\\.php\\?p=3432889\\&.*#3432889";
> >>        RegexTermEnum rte = new RegexTermEnum(reader, new Term("url",
> >> urlToSearch), regexpCapabilities);
> >>        int count = 0;
> >>        while (rte.next()) {
> >>            System.out.println(rte.term() + " " + rte.docFreq());
> >>            count++;
> >>        }
> >>        assertEquals(1, count);
> >>
> >>        urlToSearch = "http://digiland
> >> \\.libero\\.it/forum/viewtopic\\.php\\?p=3432879\\&.*#3432879";
> >>        rte = new RegexTermEnum(reader, new Term("url", urlToSearch),
> >> regexpCapabilities);
> >>        count = 0;
> >>        while (rte.next()) {
> >>            System.out.println(rte.term() + " " + rte.docFreq());
> >>            count++;
> >>        }
> >>        assertEquals(2, count);
> >>
> >>    }
> >>
> >>    private void addDocsToIndex() throws IOException {
> >>
> >>        IndexWriter writer = new IndexWriter(directory, new
> >> KeywordAnalyzer(), true, MaxFieldLength.UNLIMITED);
> >>
> >>        Document doc = new Document();
> >>        doc.add(new Field("url", "
> >>
> >>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab02505591827a90fe5010f45c#3432879
> >> ",
> >> Field.Store.YES, Field.Index.NOT_ANALYZED));
> >>        doc.add(new Field("contents", "contenuto documento 1",
> >> Field.Store.YES, Field.Index.NOT_ANALYZED));
> >>        writer.addDocument(doc);
> >>
> >>        doc = new Document();
> >>        doc.add(new Field("url", "
> >>
> >>
> http://digiland.libero.it/forum/viewtopic.php?p=3432889&sid=16c7ea74d98a8229c1ddd4800a2738ec#3432889
> >> ",
> >> Field.Store.YES, Field.Index.NOT_ANALYZED));
> >>        doc.add(new Field("contents", "contenuto documento 2",
> >> Field.Store.YES, Field.Index.NOT_ANALYZED));
> >>        writer.addDocument(doc);
> >>
> >>        doc = new Document();
> >>        doc.add(new Field("url", "
> >>
> >>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab505d98a8229c10fe5010f45c#3432879
> >> ",
> >> Field.Store.YES, Field.Index.NOT_ANALYZED));
> >>        doc.add(new Field("contents", "contenuto documento 3",
> >> Field.Store.YES, Field.Index.NOT_ANALYZED));
> >>        writer.addDocument(doc);
> >>
> >>        writer.optimize();
> >>        writer.close();
> >>    }
> >>
> >> }
> >> </code>
> >>
> >> What am I missing?
> >> Thanks.
> >>
> >> Bye,
> >> Raf
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message