Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of jason.rutherglen@gmail.com
 designates 209.85.217.13 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references;
        b=swhSp2fYpJhijXK5LpWalknqccQwC/aNt9idHG2khXE/pC7+xEKYlyNp4pX+KDBnc6
         3AEWjwn8M1YPp0jjBXfhAKNZ+0Xjc5uXJ5M5RV6gVcL+AoiPfh4QJwwSpoAXpjQMK2Ct
         Kxsc6mHIFUIoi05MMrw1wTPEsxcUOyZv3VTWY=
Message-ID: <85d3c3b60809081547u75f8ea80u11133ae28db8a8ad@mail.gmail.com>
Date: Mon, 8 Sep 2008 18:47:45 -0400
From: "Jason Rutherglen" <jason.rutherglen@gmail.com>
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search for Social Networks Collaboration
In-Reply-To: <e6537db80809081517l151f4fbbr88b8baed06a522ab@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <949792.30262.qm@web50305.mail.re2.yahoo.com>
	 <e6537db80809070104o346d290ax78aced2d067558bb@mail.gmail.com>
	 <85d3c3b60809081405q5d06e4c9j55fa064fa09a48dc@mail.gmail.com>
	 <e6537db80809081517l151f4fbbr88b8baed06a522ab@mail.gmail.com>

Cool.  I mention H2 because it does have some Lucene code in it yes.
Also according to some benchmarks it's the fastest of the open source
databases.  I think it's possible to integrate realtime search for H2.
 I suppose there is no need to store the data in Lucene in this case?
One loses the multiple values per field Lucene offers, and the schema
become static.  Perhaps it's a trade off?

On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <joaquin.delgado@gmail.com> wrot=
e:
> Yes, both Marcelo and I would be interested.
>
> We looked into H2 and it looks like something similar to Oracle's ODCI ca=
n
> be implemented. Plus the primitive full-text implementaci=F3n is based on
> Lucene.
> I say primitive because looking at the code I saw that one cannot define =
an
> Analyzer and for each scan corresponding to a where clause a searcher is
> open and closed, instead of having a pool, plus it does not have any way =
to
> queue changes to reduce the use of the IndexWriter, etc.
>
> But its open source and that is a great starting point!
>
> -- Joaquin
>
> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
>>
>> Perhaps an interesting project would be to integrate Ocean with H2
>> www.h2database.com to take advantage of both models.  I'm not sure how
>> exactly that would work, but it seems like it would not be too
>> difficult.  Perhaps this would solve being able to perform faster
>> hierarchical queries and perhaps other types of queries that Lucene is
>> not capable of.
>>
>> Is this something Joaquin you are interested in collaborating on?  I
>> am definitely interested in it.
>>
>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <joaquin.delgado@gmail.com>
>> wrote:
>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>> > <otis_gospodnetic@yahoo.com> wrote:
>> >>
>> >> Regarding real-time search and Solr, my feeling is the focus should b=
e
>> >> on
>> >> first adding real-time search to Lucene, and then we'll figure out ho=
w
>> >> to
>> >> incorporate that into Solr later.
>> >
>> >
>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>> >  Note
>> > that Lucene, being a indexing/search library (and not a full blown
>> > search
>> > engine), is by definition "real-time": once you add/write a document t=
o
>> > the
>> > index it becomes immediately searchable and if a document is logically
>> > deleted and no longer returned in a search, though physical deletion
>> > happens
>> > during an index optimization.
>> >
>> > Now, the problem of adding/deleting documents in bulk, as part of a
>> > transaction and making these documents available for search immediatel=
y
>> > after the transaction is commited sounds more like a search engine
>> > problem
>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known t=
o
>> > be
>> > I/O expensive and thus are usually implemented bached proceeses with
>> > some
>> > kind of sync mechanism, which makes them non real-time.
>> >
>> > For example, in my previous life, I designed and help implement a
>> > quasi-realtime enterprise search engine using Lucene, having a set of
>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>> > accross
>> > different search services which powered a broker based distributed
>> > search
>> > interface. The most recent documents provided to the indexers were
>> > always
>> > added to the smaller in-memory (RAM) indexes which usually could absor=
be
>> > the
>> > load of a bulk "add" transaction and later would be merged into larger
>> > disk
>> > based indexes and then flushed to make them ready to absorbe new fresh
>> > docs.
>> > We even had further partitioning of the indexes that reflected time
>> > periods
>> > with caps on size for them to be merged into older more archive based
>> > indexes which were used less (yes the search engine default search was
>> > on
>> > data no more than 1 month old, though user could open the time window =
by
>> > including archives).
>> >
>> > As for SOLR and OCEAN,  I would argue that these semi-structured searc=
h
>> > engines are becomming more and more like relational databases with
>> > full-text
>> > search capablities (without the benefit of full reletional algebra --
>> > for
>> > example joins are not possible using SOLR). Notice that "real-time" CR=
UD
>> > operations and transactionality are core DB concepts adn have been
>> > studied
>> > and developed by database communities for aquite long time. There has
>> > been
>> > recent efforts on how to effeciently integrate Lucene into releational
>> > databases (see Lucene JVM ORACLE integration, see
>> >
>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-or=
acle-jvm.html)
>> >
>> > I think we should seriously look at joining efforts with open-source
>> > Database engine projects, written in Java (see
>> > http://java-source.net/open-source/database-engines) in order to blend
>> > IR
>> > and ORM for once and for all.
>> >
>> > -- Joaquin
>> >
>> >
>> >>
>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number =
of
>> >> times to understand bits and pieces of it.  I have to admit there is
>> >> still
>> >> some fuzziness about the whole things in my head - is "Ocean" somethi=
ng
>> >> that
>> >> already works, a separate project on googlecode.com?  I think so.  If
>> >> so,
>> >> and if you are working on getting it integrated into Lucene, would it
>> >> make
>> >> it less confusing to just refer to it as "real-time search", so there
>> >> is no
>> >> confusion?
>> >>
>> >> If this is to be initially integrated into Lucene, why are things lik=
e
>> >> replication, crowding/field collapsing, locallucene, name service, ta=
g
>> >> index, etc. all mentioned there on the Wiki and bundled with
>> >> description of
>> >> how real-time search works and is to be implemented?  I suppose
>> >> mentioning
>> >> replication kind-of makes sense because the replication approach is
>> >> closely
>> >> tied to real-time search - all query nodes need to see index changes
>> >> fast.
>> >>  But Lucene itself offers no replication mechanism, so maybe the
>> >> replication
>> >> is something to figure out separately, say on the Solr level, later o=
n
>> >> "once
>> >> we get there".  I think even just the essential real-time search
>> >> requires
>> >> substantial changes to Lucene (I remember seeing large patches in
>> >> JIRA),
>> >> which makes it hard to digest, understand, comment on, and ultimately
>> >> commit
>> >> (hence the luke warm response, I think).  Bringing other non-essentia=
l
>> >> elements into discussion at the same time makes it more difficult t o
>> >>  process all this new stuff, at least for me.  Am I the only one who
>> >> finds
>> >> this hard?
>> >>
>> >> That said, it sounds like we have some discussion going (Karl...), so=
 I
>> >> look forward to understanding more! :)
>> >>
>> >>
>> >> Otis
>> >> --
>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> > From: Yonik Seeley <yonik@apache.org>
>> >> > To: java-dev@lucene.apache.org
>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>> >> >
>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> >> > wrote:
>> >> > > I also think it's got a
>> >> > > lot of things now which makes integration difficult to do properl=
y.
>> >> >
>> >> > I agree, and that's why the major bump in version number rather tha=
n
>> >> > minor - we recognize that some features will need some amount of
>> >> > rearchitecture.
>> >> >
>> >> > > I think the problem with integration with SOLR is it was designed
>> >> > > with
>> >> > > a different problem set in mind than Ocean, originally the CNET
>> >> > > shopping application.
>> >> >
>> >> > That was the first use of Solr, but it actually existed before that
>> >> > w/o any defined use other than to be a "plan B" alternative to MySQ=
L
>> >> > based search servers (that's actually where some of the parameter
>> >> > names come from... the default /select URL instead of /search, the
>> >> > "rows" parameter, etc).
>> >> >
>> >> > But you're right... some things like the replication strategy were
>> >> > designed (well, borrowed from Doug to be exact) with the idea that =
it
>> >> > would be OK to have slightly "stale" views of the data in the range
>> >> > of
>> >> > minutes.  It just made things easier/possible at the time.  But ton=
s
>> >> > of Solr and Lucene users want almost instantaneous visibility of
>> >> > added
>> >> > documents, if they can get it.  It's hardly restricted to social
>> >> > network applications.
>> >> >
>> >> > Bottom line is that Solr aims to be a general enterprise search
>> >> > platform, and getting as real-time as we can get, and as scalable a=
s
>> >> > we can get are some of the top priorities going forward.
>> >> >
>> >> > -Yonik
>> >> >
>> >> > -------------------------------------------------------------------=
--
>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org