lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: Lucene search in URL
Date Sat, 19 Sep 2009 22:00:40 GMT
> Dear List,
> 
> I'm working on a project where i have to check a Blacklist
> of URL's with Lucene. (about 500.000)
> Is it possible to search for a URL in a hierarchical
> context?
> 
> for Example:
> Blacklist entry: "en.wikipedia.org/wiki/production_code"
> 
> "en.wikipedia.org/wiki/production_code/test" should match
> "en.wikipedia.org/wiki/test" should not match

If any substring (0 to n) of your query matches a document completely than that query should
match, right? Thats what I understand from your examples.

You can achieve this bu using two different analyzers for index and query time.

query analyzer:

KeywordTokenizer
EdgeNGramTokenFilter (side = EdgeNGramTokenFilter.Side.FRONT , mingram = 1, maxgram=512)

index analyzer:

KeywordTokenizer

The index analyzer comes out-of-the-box: org.apache.lucene.analysis.KeywordAnalyzer
But you need to write query analyzer.

If you want case-insensitive search you can add LowercaseFilter to both of your analyzers.


By using this, your black list urls will be indexed verbatim. (one token)

Your query "en.wikipedia.org/wiki/production_code/test" 
will be broken in to these pieces and one of them will match your document:

e 
en 
en. 
en.w 
en.wi 
en.wik 
en.wiki 
en.wikip 
en.wikipe 
en.wikiped 
en.wikipedi 
en.wikipedia 
en.wikipedia. 
en.wikipedia.o 
en.wikipedia.or 
en.wikipedia.org 
en.wikipedia.org/ 
en.wikipedia.org/w 
en.wikipedia.org/wi 
en.wikipedia.org/wik 
en.wikipedia.org/wiki 
en.wikipedia.org/wiki/ 
en.wikipedia.org/wiki/p 
en.wikipedia.org/wiki/pr 
en.wikipedia.org/wiki/pro 
en.wikipedia.org/wiki/prod 
en.wikipedia.org/wiki/produ 
en.wikipedia.org/wiki/produc 
en.wikipedia.org/wiki/product 
en.wikipedia.org/wiki/producti 
en.wikipedia.org/wiki/productio 
en.wikipedia.org/wiki/production 
en.wikipedia.org/wiki/production_ 
en.wikipedia.org/wiki/production_c 
en.wikipedia.org/wiki/production_co 
en.wikipedia.org/wiki/production_cod 
* en.wikipedia.org/wiki/production_code  // this is your document a match
en.wikipedia.org/wiki/production_code/ 
en.wikipedia.org/wiki/production_code/t 
en.wikipedia.org/wiki/production_code/te 
en.wikipedia.org/wiki/production_code/tes 
en.wikipedia.org/wiki/production_code/test 

The none of the pieces of the query "en.wikipedia.org/wiki/test" will match your document.

Hope this helps.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message