lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucifer Hammer <luce...@gmail.com>
Subject Re: Implementing filtering based on multiple fields
Date Tue, 12 Jan 2010 14:11:55 GMT
Why not just add custom terms onto the end of each query for each user?
i.e.  When user X queries for "bananas", and has previously set their
domains to search in cnn, and yahoo, then why not append the following onto
the search query:   "fullText:bananas AND (domain:cnn OR domain:yahoo)"

Off the top of my head there's a few caveats:

1) if the domain list is large, you'll have to deal with the maxbooleans
setting
2) parsing the query can be slow, however, there's a tradeoff between
managing thousands of indexes vs a slight performance hit (Or, you can put
the query together without parsing - depends on how you handle the users
query terms)

This seems like too simple an approach, I'm sure I'm not understanding
something...

LH
On Fri, Jan 8, 2010 at 5:16 AM, Yaniv Ben Yosef <yanivby@gmail.com> wrote:

> Thanks Otis, that's very helpful.
>
> On Fri, Jan 8, 2010 at 2:08 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com
>  > wrote:
>
> > Ah, well, masking it didn't help.  Yes, ignore Bixo, Nutch, and Droids
> > then.
> > Consider DataImportHandler from Solr or wait a bit for Lucene Connectors
> > Framework to materialize.  Or use LuSql, or DbSight, or Sematext's
> Database
> > Indexer.
> >
> > Yes, I was suggesting a separate index for each user.  That's what Simpy
> > uses and has some 200K indices on 1 box.... and I think dozens of QPS
> > without any caching, if I remember correctly.  Load is under 1.0.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Yaniv Ben Yosef <yanivby@gmail.com>
> > > To: java-user@lucene.apache.org
> > > Sent: Thu, January 7, 2010 6:55:18 PM
> > > Subject: Re: Implementing filtering based on multiple fields
> > >
> > > Thanks Otis.
> > >
> > > If I understand correctly - Bixo, Nutch and Droids are technologies to
> > use
> > > for crawling the web and building an index. My project is actually
> about
> > > indexing a large database, where you can think of every row as a web
> > page,
> > > and a particular column is the equivalent of a web site. (I didn't
> > mention
> > > that in the previous post because I didn't want to complicate my
> > question,
> > > and it seems equivalent to Google CSE given that Lucene can use
> virtually
> > > any input for indexing, AFAIK)
> > > Therefore I'm not sure if the frameworks you've mentioned are
> applicable
> > to
> > > my project as they seem to be related to web page indexing, but perhaps
> > I'm
> > > missing something.
> > > Also, what did you mean about isolating users and their data/indices.
> Did
> > > you mean that I should create a separate index per user?
> > >
> > > Thanks again!
> > >
> > > On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic <
> > > otis_gospodnetic@yahoo.com> wrote:
> > >
> > > > For something like CSE, I think you want to isolate users and their
> > > > data/indices.
> > > >
> > > > I'd look at Bixo or Nutch or Droids ==> Lucene or Solr
> > > >
> > > > Otis
> > > > --
> > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > > >
> > > >
> > > >
> > > > ----- Original Message ----
> > > > > From: Yaniv Ben Yosef
> > > > > To: java-user@lucene.apache.org
> > > > > Sent: Thu, January 7, 2010 3:54:22 PM
> > > > > Subject: Implementing filtering based on multiple fields
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm very new to Lucene. In fact, I'm at the beginning of an
> > evaluation
> > > > > phase, trying to figure whether Lucene is the right fit for my
> needs.
> > > > > The project I'm involved in requires something similar to the
> Google
> > > > Custom
> > > > > Search Engine (CSE). In CSE, each user can
> > > > > define a set (could be a large set) of websites, and limit the
> search
> > to
> > > > > only those websites. So for example, I can create a CSE that
> searches
> > all
> > > > > web pages on cnn.com, msnbc.com and nytimes.com only.
> > > > > I am trying to understand whether and how I can do something
> similar
> > in
> > > > > Lucene.
> > > > >
> > > > > The FAQ hints about this possibility
> > > > > here,
> > > > > but it mentions a class that no longer exists in 3.0 (QueryFilter),
> > and
> > > > is
> > > > > very laconic about the suggested options. Also I'm not sure how
> well
> > it
> > > > will
> > > > > perform in my use case (or even if it fits at all).
> > > > > I thought about creating a separate index for each user or CSE.
> > However,
> > > > my
> > > > > system should be able to handle tens of thousands of concurrent
> > users. I
> > > > > haven't done any analysis yet on how this will affect CPU, RAM, I/O
> > and
> > > > > storage size, but was wondering if any of you experienced Lucene
> > > > > users/developers think it's a good direction.
> > > > > If that's not a good idea, what would be a good strategy here?
> > > > >
> > > > > Any help will be much appreciated,
> > > > > Yaniv
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message