Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 78108 invoked from network); 8 Sep 2008 22:49:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Sep 2008 22:49:25 -0000 Received: (qmail 363 invoked by uid 500); 8 Sep 2008 22:49:16 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 313 invoked by uid 500); 8 Sep 2008 22:49:16 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 304 invoked by uid 99); 8 Sep 2008 22:49:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2008 15:49:16 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jason.rutherglen@gmail.com designates 209.85.217.13 as permitted sender) Received: from [209.85.217.13] (HELO mail-gx0-f13.google.com) (209.85.217.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2008 22:48:15 +0000 Received: by gxk6 with SMTP id 6so2202173gxk.5 for ; Mon, 08 Sep 2008 15:47:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=P0e5Us79kif5dtuVO3lShSfpdAERR6sumwMrc3pWZIc=; b=Cf9EdmBD29yxhuxUHVIWxB1ppYmquEMqxXv+U0Z0/kI/lLDTEWHJ3B+OJ/x87TZ37c HJfFSxiSIpBb615myHSSvfsi5E7xjIeDYpLfarGCpH/W3OWTuUY259+aZEnruidD86UL t6NNsSchoDl0wfx4OHh3bamZggvl23fTzjIHo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=swhSp2fYpJhijXK5LpWalknqccQwC/aNt9idHG2khXE/pC7+xEKYlyNp4pX+KDBnc6 3AEWjwn8M1YPp0jjBXfhAKNZ+0Xjc5uXJ5M5RV6gVcL+AoiPfh4QJwwSpoAXpjQMK2Ct Kxsc6mHIFUIoi05MMrw1wTPEsxcUOyZv3VTWY= Received: by 10.151.45.2 with SMTP id x2mr22237887ybj.34.1220914066009; Mon, 08 Sep 2008 15:47:46 -0700 (PDT) Received: by 10.151.118.7 with HTTP; Mon, 8 Sep 2008 15:47:45 -0700 (PDT) Message-ID: <85d3c3b60809081547u75f8ea80u11133ae28db8a8ad@mail.gmail.com> Date: Mon, 8 Sep 2008 18:47:45 -0400 From: "Jason Rutherglen" To: java-dev@lucene.apache.org Subject: Re: Realtime Search for Social Networks Collaboration In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <949792.30262.qm@web50305.mail.re2.yahoo.com> <85d3c3b60809081405q5d06e4c9j55fa064fa09a48dc@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Cool. I mention H2 because it does have some Lucene code in it yes. Also according to some benchmarks it's the fastest of the open source databases. I think it's possible to integrate realtime search for H2. I suppose there is no need to store the data in Lucene in this case? One loses the multiple values per field Lucene offers, and the schema become static. Perhaps it's a trade off? On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado wrot= e: > Yes, both Marcelo and I would be interested. > > We looked into H2 and it looks like something similar to Oracle's ODCI ca= n > be implemented. Plus the primitive full-text implementaci=F3n is based on > Lucene. > I say primitive because looking at the code I saw that one cannot define = an > Analyzer and for each scan corresponding to a where clause a searcher is > open and closed, instead of having a pool, plus it does not have any way = to > queue changes to reduce the use of the IndexWriter, etc. > > But its open source and that is a great starting point! > > -- Joaquin > > On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen > wrote: >> >> Perhaps an interesting project would be to integrate Ocean with H2 >> www.h2database.com to take advantage of both models. I'm not sure how >> exactly that would work, but it seems like it would not be too >> difficult. Perhaps this would solve being able to perform faster >> hierarchical queries and perhaps other types of queries that Lucene is >> not capable of. >> >> Is this something Joaquin you are interested in collaborating on? I >> am definitely interested in it. >> >> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado >> wrote: >> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic >> > wrote: >> >> >> >> Regarding real-time search and Solr, my feeling is the focus should b= e >> >> on >> >> first adding real-time search to Lucene, and then we'll figure out ho= w >> >> to >> >> incorporate that into Solr later. >> > >> > >> > Otis, what do you mean exactly by "adding real-time search to Lucene"? >> > Note >> > that Lucene, being a indexing/search library (and not a full blown >> > search >> > engine), is by definition "real-time": once you add/write a document t= o >> > the >> > index it becomes immediately searchable and if a document is logically >> > deleted and no longer returned in a search, though physical deletion >> > happens >> > during an index optimization. >> > >> > Now, the problem of adding/deleting documents in bulk, as part of a >> > transaction and making these documents available for search immediatel= y >> > after the transaction is commited sounds more like a search engine >> > problem >> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known t= o >> > be >> > I/O expensive and thus are usually implemented bached proceeses with >> > some >> > kind of sync mechanism, which makes them non real-time. >> > >> > For example, in my previous life, I designed and help implement a >> > quasi-realtime enterprise search engine using Lucene, having a set of >> > multi-threaded indexers hitting a set of multiple indexes alocatted >> > accross >> > different search services which powered a broker based distributed >> > search >> > interface. The most recent documents provided to the indexers were >> > always >> > added to the smaller in-memory (RAM) indexes which usually could absor= be >> > the >> > load of a bulk "add" transaction and later would be merged into larger >> > disk >> > based indexes and then flushed to make them ready to absorbe new fresh >> > docs. >> > We even had further partitioning of the indexes that reflected time >> > periods >> > with caps on size for them to be merged into older more archive based >> > indexes which were used less (yes the search engine default search was >> > on >> > data no more than 1 month old, though user could open the time window = by >> > including archives). >> > >> > As for SOLR and OCEAN, I would argue that these semi-structured searc= h >> > engines are becomming more and more like relational databases with >> > full-text >> > search capablities (without the benefit of full reletional algebra -- >> > for >> > example joins are not possible using SOLR). Notice that "real-time" CR= UD >> > operations and transactionality are core DB concepts adn have been >> > studied >> > and developed by database communities for aquite long time. There has >> > been >> > recent efforts on how to effeciently integrate Lucene into releational >> > databases (see Lucene JVM ORACLE integration, see >> > >> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-or= acle-jvm.html) >> > >> > I think we should seriously look at joining efforts with open-source >> > Database engine projects, written in Java (see >> > http://java-source.net/open-source/database-engines) in order to blend >> > IR >> > and ORM for once and for all. >> > >> > -- Joaquin >> > >> > >> >> >> >> I've read Jason's Wiki as well. Actually, I had to read it a number = of >> >> times to understand bits and pieces of it. I have to admit there is >> >> still >> >> some fuzziness about the whole things in my head - is "Ocean" somethi= ng >> >> that >> >> already works, a separate project on googlecode.com? I think so. If >> >> so, >> >> and if you are working on getting it integrated into Lucene, would it >> >> make >> >> it less confusing to just refer to it as "real-time search", so there >> >> is no >> >> confusion? >> >> >> >> If this is to be initially integrated into Lucene, why are things lik= e >> >> replication, crowding/field collapsing, locallucene, name service, ta= g >> >> index, etc. all mentioned there on the Wiki and bundled with >> >> description of >> >> how real-time search works and is to be implemented? I suppose >> >> mentioning >> >> replication kind-of makes sense because the replication approach is >> >> closely >> >> tied to real-time search - all query nodes need to see index changes >> >> fast. >> >> But Lucene itself offers no replication mechanism, so maybe the >> >> replication >> >> is something to figure out separately, say on the Solr level, later o= n >> >> "once >> >> we get there". I think even just the essential real-time search >> >> requires >> >> substantial changes to Lucene (I remember seeing large patches in >> >> JIRA), >> >> which makes it hard to digest, understand, comment on, and ultimately >> >> commit >> >> (hence the luke warm response, I think). Bringing other non-essentia= l >> >> elements into discussion at the same time makes it more difficult t o >> >> process all this new stuff, at least for me. Am I the only one who >> >> finds >> >> this hard? >> >> >> >> That said, it sounds like we have some discussion going (Karl...), so= I >> >> look forward to understanding more! :) >> >> >> >> >> >> Otis >> >> -- >> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> >> >> >> >> ----- Original Message ---- >> >> > From: Yonik Seeley >> >> > To: java-dev@lucene.apache.org >> >> > Sent: Thursday, September 4, 2008 10:13:32 AM >> >> > Subject: Re: Realtime Search for Social Networks Collaboration >> >> > >> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen >> >> > wrote: >> >> > > I also think it's got a >> >> > > lot of things now which makes integration difficult to do properl= y. >> >> > >> >> > I agree, and that's why the major bump in version number rather tha= n >> >> > minor - we recognize that some features will need some amount of >> >> > rearchitecture. >> >> > >> >> > > I think the problem with integration with SOLR is it was designed >> >> > > with >> >> > > a different problem set in mind than Ocean, originally the CNET >> >> > > shopping application. >> >> > >> >> > That was the first use of Solr, but it actually existed before that >> >> > w/o any defined use other than to be a "plan B" alternative to MySQ= L >> >> > based search servers (that's actually where some of the parameter >> >> > names come from... the default /select URL instead of /search, the >> >> > "rows" parameter, etc). >> >> > >> >> > But you're right... some things like the replication strategy were >> >> > designed (well, borrowed from Doug to be exact) with the idea that = it >> >> > would be OK to have slightly "stale" views of the data in the range >> >> > of >> >> > minutes. It just made things easier/possible at the time. But ton= s >> >> > of Solr and Lucene users want almost instantaneous visibility of >> >> > added >> >> > documents, if they can get it. It's hardly restricted to social >> >> > network applications. >> >> > >> >> > Bottom line is that Solr aims to be a general enterprise search >> >> > platform, and getting as real-time as we can get, and as scalable a= s >> >> > we can get are some of the top priorities going forward. >> >> > >> >> > -Yonik >> >> > >> >> > -------------------------------------------------------------------= -- >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org