lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franz Allan Valencia See <franz....@gmail.com>
Subject Re: Modifying IDF
Date Mon, 01 Feb 2010 13:04:54 GMT
Hmm....

My Analyzer is a Dictionary-based Analyzer. And so, it only recognizes
tokens in its dictionary. Adding every url (or domain) is not a viable
solution.

So how could I include that to my analyzer? Lucene Filter? FilterReader?

Thanks,

-- 
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sun, Jan 31, 2010 at 1:58 AM, Ian Lea <ian.lea@gmail.com> wrote:

> Are you asking how to get lucene.apache.org out of
> http://lucene.apache.org/ or how to get apache.org out of
> lucene.apache.org?  The getHost() method of java.net.URL will give you
> the former. Or use a regexp.  I don't know an easy way to do the
> latter, but depending on your requirements you could split
> lucene.apache.org into tokens "lucene.apache.org" and "apache.org" and
> "org" and index all of them.  You probably want to use an analyzer
> that doesn't split on the . character.
>
>
> --
> Ian.
>
>
> On Sat, Jan 30, 2010 at 12:12 AM, Franz Allan Valencia See
> <franz.see@gmail.com> wrote:
> > How should I go about identifying the domain?
> >
> > Thanks,
> >
> > --
> > Franz Allan Valencia See | Java Software Engineer
> > franz.see@gmail.com
> > LinkedIn: http://www.linkedin.com/in/franzsee
> > Twitter: http://www.twitter.com/franz_see
> >
> > On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea <ian.lea@gmail.com> wrote:
> >
> >> Instead of playing around with tf/idf, how about just indexing and
> >> searching the domain.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
> >> <franz.see@gmail.com> wrote:
> >> > Good day,
> >> >
> >> > I am currently using lucene for my searches. And one of the problems
> that
> >> Im
> >> > facing is when keyword is a url. The tokens such as http, https, ://,
> >> index,
> >> > html, etc seems to be messing up with our search results. The focus
> was
> >> > supposed to be only on the url domain.
> >> >
> >> > The idea that I have is modify the idf so that rare terms get boosted
> >> much
> >> > more than the default settings in lucene. Since there are probably a
> lot
> >> of
> >> > http, https://, etc, then matches to these terms should be really
> really
> >> > low, while matches to the domain (which is rare) should be high.
> >> >
> >> > Would this work or am I totally misunderstanding lucene's tf/idf? :-)
> >> >
> >> > Thanks,
> >> >
> >> > --
> >> > Franz Allan Valencia See | Java Software Engineer
> >> > franz.see@gmail.com
> >> > LinkedIn: http://www.linkedin.com/in/franzsee
> >> > Twitter: http://www.twitter.com/franz_see
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message