lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Lucene search in URL
Date Sun, 20 Sep 2009 10:14:11 GMT
> Is there a possibility in Lucene to do a Exact Search with
> Tokenized text?
> Like: "" is Tokenized
> in
> ""
> "wiki"
> "production"
> "code"
> with Standardanalyzer.
> And a search will match iff(and only if) all the Tokens
> match?
> Like "" matches
> "" does not match.
> The Purpose of this is following:
> I have a Blacklist of URLs.
> If i want to access a URL the Domain is searched in Lucene.
> (fast)
> If there is a match, following will be searched (a bit
> slowlier)
> "" -> does not match
> "" -> does not match
> * "" -> Matches, so
> the URL and all subURLs are blocked.
> So my Question is, is there a possibility to specify an
> Query to serch only for exact Document-Matches.

Document : ""

Query 1  : ""  should match
Query 2  : ""                  should not match
Query 3  : ""            should not match

In my proposed solution Query 3 will also match. And you don't want that. 

Am I correct?

So we cannot use letter based NGrams. We need token based Ngrams (aka Shingle)

Regarding your question "search will match iff(and only if) all the Tokens match?"
1-) all tokens in the query : Yes by setting default operator to AND.
2-) all tokens in the document: AFAIK There is no such mechanism.

You want a document match if all tokens in the document match query terms.
IMO to simulate this you need to store docs using keywordanalyzer, and manipulate queries.
Since you store document as a string, exact match is guaranteed.

Query 1:
*     [match]

Query 2:

Query 3:

In this scenario only Q1 matches. Index analyzer is same keyword analyzer.


1-) Extension of CharTokenizer that breaks only at '/' character 

protected boolean isTokenChar(char c) {
        return !(c == '/');

2-) Modified ShingleFilter that uses '/' as TokenSeperator with  maxShingleSize=512

public static final String TOKEN_SEPARATOR = "/";

In this configuration only Q1 match but this query analyzer produces unnecessary tokens: For
Q1 it produces 10 tokens: word shingle shingle shingle
wiki word
wiki/production_code shingle
wiki/production_code/test shingle
production_code word
production_code/test shingle
test word

You need only first 4, the rest are not harmfull but unnecessary. May be you can modify this
filter to output first n tokens only.

Hope this helps.

P.S. I didn't see any methods to change TOKEN_SEPARATOR in ShingleFilter.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message