lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Delgado" <joaquin.delg...@gmail.com>
Subject Re: Realtime Search for Social Networks Collaboration
Date Mon, 08 Sep 2008 22:17:30 GMT
Yes, both Marcelo and I would be interested.

We looked into H2 and it looks like something similar to Oracle's ODCI can
be implemented. Plus the primitive full-text implementación is based on
Lucene.
I say primitive because looking at the code I saw that one cannot define an
Analyzer and for each scan corresponding to a where clause a searcher is
open and closed, instead of having a pool, plus it does not have any way to
queue changes to reduce the use of the IndexWriter, etc.

But its open source and that is a great starting point!

-- Joaquin

On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> Perhaps an interesting project would be to integrate Ocean with H2
> www.h2database.com to take advantage of both models.  I'm not sure how
> exactly that would work, but it seems like it would not be too
> difficult.  Perhaps this would solve being able to perform faster
> hierarchical queries and perhaps other types of queries that Lucene is
> not capable of.
>
> Is this something Joaquin you are interested in collaborating on?  I
> am definitely interested in it.
>
> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <joaquin.delgado@gmail.com>
> wrote:
> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
> > <otis_gospodnetic@yahoo.com> wrote:
> >>
> >> Regarding real-time search and Solr, my feeling is the focus should be
> on
> >> first adding real-time search to Lucene, and then we'll figure out how
> to
> >> incorporate that into Solr later.
> >
> >
> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>  Note
> > that Lucene, being a indexing/search library (and not a full blown search
> > engine), is by definition "real-time": once you add/write a document to
> the
> > index it becomes immediately searchable and if a document is logically
> > deleted and no longer returned in a search, though physical deletion
> happens
> > during an index optimization.
> >
> > Now, the problem of adding/deleting documents in bulk, as part of a
> > transaction and making these documents available for search immediately
> > after the transaction is commited sounds more like a search engine
> problem
> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
> be
> > I/O expensive and thus are usually implemented bached proceeses with some
> > kind of sync mechanism, which makes them non real-time.
> >
> > For example, in my previous life, I designed and help implement a
> > quasi-realtime enterprise search engine using Lucene, having a set of
> > multi-threaded indexers hitting a set of multiple indexes alocatted
> accross
> > different search services which powered a broker based distributed search
> > interface. The most recent documents provided to the indexers were always
> > added to the smaller in-memory (RAM) indexes which usually could absorbe
> the
> > load of a bulk "add" transaction and later would be merged into larger
> disk
> > based indexes and then flushed to make them ready to absorbe new fresh
> docs.
> > We even had further partitioning of the indexes that reflected time
> periods
> > with caps on size for them to be merged into older more archive based
> > indexes which were used less (yes the search engine default search was on
> > data no more than 1 month old, though user could open the time window by
> > including archives).
> >
> > As for SOLR and OCEAN,  I would argue that these semi-structured search
> > engines are becomming more and more like relational databases with
> full-text
> > search capablities (without the benefit of full reletional algebra -- for
> > example joins are not possible using SOLR). Notice that "real-time" CRUD
> > operations and transactionality are core DB concepts adn have been
> studied
> > and developed by database communities for aquite long time. There has
> been
> > recent efforts on how to effeciently integrate Lucene into releational
> > databases (see Lucene JVM ORACLE integration, see
> >
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
> )
> >
> > I think we should seriously look at joining efforts with open-source
> > Database engine projects, written in Java (see
> > http://java-source.net/open-source/database-engines) in order to blend
> IR
> > and ORM for once and for all.
> >
> > -- Joaquin
> >
> >
> >>
> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
> >> times to understand bits and pieces of it.  I have to admit there is
> still
> >> some fuzziness about the whole things in my head - is "Ocean" something
> that
> >> already works, a separate project on googlecode.com?  I think so.  If
> so,
> >> and if you are working on getting it integrated into Lucene, would it
> make
> >> it less confusing to just refer to it as "real-time search", so there is
> no
> >> confusion?
> >>
> >> If this is to be initially integrated into Lucene, why are things like
> >> replication, crowding/field collapsing, locallucene, name service, tag
> >> index, etc. all mentioned there on the Wiki and bundled with description
> of
> >> how real-time search works and is to be implemented?  I suppose
> mentioning
> >> replication kind-of makes sense because the replication approach is
> closely
> >> tied to real-time search - all query nodes need to see index changes
> fast.
> >>  But Lucene itself offers no replication mechanism, so maybe the
> replication
> >> is something to figure out separately, say on the Solr level, later on
> "once
> >> we get there".  I think even just the essential real-time search
> requires
> >> substantial changes to Lucene (I remember seeing large patches in JIRA),
> >> which makes it hard to digest, understand, comment on, and ultimately
> commit
> >> (hence the luke warm response, I think).  Bringing other non-essential
> >> elements into discussion at the same time makes it more difficult t o
> >>  process all this new stuff, at least for me.  Am I the only one who
> finds
> >> this hard?
> >>
> >> That said, it sounds like we have some discussion going (Karl...), so I
> >> look forward to understanding more! :)
> >>
> >>
> >> Otis
> >> --
> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Yonik Seeley <yonik@apache.org>
> >> > To: java-dev@lucene.apache.org
> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
> >> > Subject: Re: Realtime Search for Social Networks Collaboration
> >> >
> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> >> > wrote:
> >> > > I also think it's got a
> >> > > lot of things now which makes integration difficult to do properly.
> >> >
> >> > I agree, and that's why the major bump in version number rather than
> >> > minor - we recognize that some features will need some amount of
> >> > rearchitecture.
> >> >
> >> > > I think the problem with integration with SOLR is it was designed
> with
> >> > > a different problem set in mind than Ocean, originally the CNET
> >> > > shopping application.
> >> >
> >> > That was the first use of Solr, but it actually existed before that
> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
> >> > based search servers (that's actually where some of the parameter
> >> > names come from... the default /select URL instead of /search, the
> >> > "rows" parameter, etc).
> >> >
> >> > But you're right... some things like the replication strategy were
> >> > designed (well, borrowed from Doug to be exact) with the idea that it
> >> > would be OK to have slightly "stale" views of the data in the range of
> >> > minutes.  It just made things easier/possible at the time.  But tons
> >> > of Solr and Lucene users want almost instantaneous visibility of added
> >> > documents, if they can get it.  It's hardly restricted to social
> >> > network applications.
> >> >
> >> > Bottom line is that Solr aims to be a general enterprise search
> >> > platform, and getting as real-time as we can get, and as scalable as
> >> > we can get are some of the top priorities going forward.
> >> >
> >> > -Yonik
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message