lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Klingler <off...@florian-klingler.at>
Subject Re: Lucene search in URL
Date Sun, 20 Sep 2009 08:10:33 GMT
Thanks for all the Answers,


I'll now try to implement this.
But i have another question now:

Is there a possibility in Lucene to do a Exact Search with
Tokenized text?

Like: "en.wikipedia.org/wiki/production_code" is Tokenized in
"en.wikipedia.org"
"wiki"
"production"
"code"
with Standardanalyzer.

And a search will match iff(and only if) all the Tokens match?
Like "en.wikipedia.org/wiki/production_code" matches
"en.wikipedia.org" does not match.


The Purpose of this is following:
I have a Blacklist of URLs.
If i want to access a URL the Domain is searched in Lucene. (fast)
If there is a match, following will be searched (a bit slowlier)
"en.wikipedia.org/wiki" -> does not match
"en.wikipedia.org/wiki/production" -> does not match
* "en.wikipedia.org/wiki/production_code" -> Matches, so the URL and all subURLs are blocked.

So my Question is, is there a possibility to specify an Query to serch only for exact Document-Matches.


Thanks very much,
Florian Klingler

----- Urspr√ľngliche Mail -----
Von: "Anshum" <anshumg@gmail.com>
An: java-user@lucene.apache.org
Gesendet: Sonntag, 20. September 2009 06:58:24
Betreff: Re: Lucene search in URL

Hi Florian,
Perhaps you might run into issues with using an ngram. How I see it is that
you need tokenized urls and need to run an exact search using a keyword
tokenizer on the search string.
You could try this. I am assuming it'll work.
so something like
en.wikipedia.org/wiki/production_code/test
gets tokenized as
[en] [wikipedia] [org] [wiki[ [production_code] [test]

so an exact search for any set of subsequent (while maintaining the order)
would get you the result. And yes, you might want to look at your tokenizers
a little bit.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Sun, Sep 20, 2009 at 3:30 AM, AHMET ARSLAN <iorixxx@yahoo.com> wrote:

> > Dear List,
> >
> > I'm working on a project where i have to check a Blacklist
> > of URL's with Lucene. (about 500.000)
> > Is it possible to search for a URL in a hierarchical
> > context?
> >
> > for Example:
> > Blacklist entry: "en.wikipedia.org/wiki/production_code"
> >
> > "en.wikipedia.org/wiki/production_code/test" should match
> > "en.wikipedia.org/wiki/test" should not match
>
> If any substring (0 to n) of your query matches a document completely than
> that query should match, right? Thats what I understand from your examples.
>
> You can achieve this bu using two different analyzers for index and query
> time.
>
> query analyzer:
>
> KeywordTokenizer
> EdgeNGramTokenFilter (side = EdgeNGramTokenFilter.Side.FRONT , mingram = 1,
> maxgram=512)
>
> index analyzer:
>
> KeywordTokenizer
>
> The index analyzer comes out-of-the-box:
> org.apache.lucene.analysis.KeywordAnalyzer
> But you need to write query analyzer.
>
> If you want case-insensitive search you can add LowercaseFilter to both of
> your analyzers.
>
> By using this, your black list urls will be indexed verbatim. (one token)
>
> Your query "en.wikipedia.org/wiki/production_code/test"
> will be broken in to these pieces and one of them will match your document:
>
> e
> en
> en.
> en.w
> en.wi
> en.wik
> en.wiki
> en.wikip
> en.wikipe
> en.wikiped
> en.wikipedi
> en.wikipedia
> en.wikipedia.
> en.wikipedia.o
> en.wikipedia.or
> en.wikipedia.org
> en.wikipedia.org/
> en.wikipedia.org/w
> en.wikipedia.org/wi
> en.wikipedia.org/wik
> en.wikipedia.org/wiki
> en.wikipedia.org/wiki/
> en.wikipedia.org/wiki/p
> en.wikipedia.org/wiki/pr
> en.wikipedia.org/wiki/pro
> en.wikipedia.org/wiki/prod
> en.wikipedia.org/wiki/produ
> en.wikipedia.org/wiki/produc
> en.wikipedia.org/wiki/product
> en.wikipedia.org/wiki/producti
> en.wikipedia.org/wiki/productio
> en.wikipedia.org/wiki/production
> en.wikipedia.org/wiki/production_
> en.wikipedia.org/wiki/production_c
> en.wikipedia.org/wiki/production_co
> en.wikipedia.org/wiki/production_cod
> * en.wikipedia.org/wiki/production_code  // this is your document a match
> en.wikipedia.org/wiki/production_code/
> en.wikipedia.org/wiki/production_code/t
> en.wikipedia.org/wiki/production_code/te
> en.wikipedia.org/wiki/production_code/tes
> en.wikipedia.org/wiki/production_code/test
>
> The none of the pieces of the query "en.wikipedia.org/wiki/test" will
> match your document.
>
> Hope this helps.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message