lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Realtime Search for Social Networks Collaboration
Date Mon, 08 Sep 2008 04:39:21 GMT
Hi,


----- Original Message ----
From: J. Delgado <joaquin.delgado@gmail.com>
To: java-dev@lucene.apache.org
Sent: Sunday, September 7, 2008 4:04:58 AM
Subject: Re: Realtime Search for Social Networks Collaboration


On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:

Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time
search to Lucene, and then we'll figure out how to incorporate that into Solr later.
 
Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note that Lucene,
being a indexing/search library (and not a full blown search engine), is by definition "real-time":
once you add/write a document to the index it becomes immediately searchable and if a document
is logically deleted and no longer returned in a search, though physical deletion happens
during an index optimization.

OG: When I think about real-time search I see it as: "Make the newly added document show up
in search results without closing and reopening the whole index with IndexWriter.  In other
words, minimize re-reading of the old/unchanged data just to be able to see the newly added
data."

I believe this is similar to what IndexReader.reopen does.... and Jason does make use of it.

Otis


Now, the problem of adding/deletingdocuments in bulk, as part of a transaction and making
these documents available for search immediately after the transaction is commited sounds
more like a search engine problem (i.e. SOLR, Nutch, Ocean), specially if these transactions
are known to be I/O expensive and thus are usually implemented bached proceeses with some
kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a quasi-realtime enterprise
search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple
indexes alocatted accross different search services which powered a broker based distributed
search interface. The most recent documents provided to the indexers were always added to
the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk "add" transaction
and later would be merged into larger disk based indexes and then flushed to make them ready
to absorbe new fresh docs. We even had further partitioning of the indexes that reflected
time periods with caps on size for them to be merged into older more archive based indexes
which were used less (yes the search engine default search was on data no more than 1 month
old, though user could open the time window by including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search engines are becomming
more and more like relational databases with full-text search capablities (without the benefit
of full reletional algebra -- for example joins are not possible using SOLR). Notice that
"real-time" CRUD operations and transactionality are core DB concepts adn have been studied
and developed by database communities for aquite long time. There has been recent efforts
on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration,
see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)

I think we should seriously look at joining efforts with open-source Database engine projects,
written in Java (see http://java-source.net/open-source/database-engines) in order to blend
IR and ORM for once and for all.

-- Joaquin 
 
 


I've read Jason's Wiki as well.  Actually, I had to read it a number of times to understand
bits and pieces of it.  I have to admit there is still some fuzziness about the whole things
in my head - is "Ocean" something that already works, a separate project on googlecode.com?
 I think so.  If so, and if you are working on getting it integrated into Lucene, would it
make it less confusing to just refer to it as "real-time search", so there is no confusion?

If this is to be initially integrated into Lucene, why are things like replication, crowding/field
collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and
bundled with description of how real-time search works and is to be implemented?  I suppose
mentioning replication kind-of makes sense because the replication approach is closely tied
to real-time search - all query nodes need to see index changes fast.  But Lucene itself offers
no replication mechanism, so maybe the replication is something to figure out separately,
say on the Solr level, later on "once we get there".  I think even just the essential real-time
search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which
makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm
response, I think).  Bringing other non-essential elements into discussion at the same time
makes it more difficult t o
 process all this new stuff, at least for me.  Am I the only one who finds this hard?

That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding
more! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




----- Original Message ----
> From: Yonik Seeley <yonik@apache.org>
> To: java-dev@lucene.apache.org
> Sent: Thursday, September 4, 2008 10:13:32 AM
> Subject: Re: Realtime Search for Social Networks Collaboration
>
> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen

> wrote:
> > I also think it's got a
> > lot of things now which makes integration difficult to do properly.
>
> I agree, and that's why the major bump in version number rather than
> minor - we recognize that some features will need some amount of
> rearchitecture.
>
> > I think the problem with integration with SOLR is it was designed with
> > a different problem set in mind than Ocean, originally the CNET
> > shopping application.
>
> That was the first use of Solr, but it actually existed before that
> w/o any defined use other than to be a "plan B" alternative to MySQL
> based search servers (that's actually where some of the parameter
> names come from... the default /select URL instead of /search, the
> "rows" parameter, etc).
>
> But you're right... some things like the replication strategy were
> designed (well, borrowed from Doug to be exact) with the idea that it
> would be OK to have slightly "stale" views of the data in the range of
> minutes.  It just made things easier/possible at the time.  But tons
> of Solr and Lucene users want almost instantaneous visibility of added
> documents, if they can get it.  It's hardly restricted to social
> network applications.
>
> Bottom line is that Solr aims to be a general enterprise search
> platform, and getting as real-time as we can get, and as scalable as
> we can get are some of the top priorities going forward.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Mime
View raw message