lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Delgado" <joaquin.delg...@gmail.com>
Subject Re: Realtime Search for Social Networks Collaboration
Date Mon, 22 Sep 2008 03:38:12 GMT
On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.paul@gmail.com> wrote:

> Moving back to RDBMS model will be a big step backwards where we miss
> mulivalued fields and arbitrary fields .


 No one is suggesting to "lose" any of the virtues of the field based
indexing that Lucene provides. All but the contrary: by extending the RDBMS
model with Lucene-based indexes one can map relational rows to documents and
columns to fields. Note that one relational field can be mapped to one or
more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation
and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World
2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar
implementation in H2 seems not only feasible but desirable.

-- Joaquin



>
> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
> > Cool.  I mention H2 because it does have some Lucene code in it yes.
> > Also according to some benchmarks it's the fastest of the open source
> > databases.  I think it's possible to integrate realtime search for H2.
> >  I suppose there is no need to store the data in Lucene in this case?
> > One loses the multiple values per field Lucene offers, and the schema
> > become static.  Perhaps it's a trade off?
> >
> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <joaquin.delgado@gmail.com>
> wrote:
> >> Yes, both Marcelo and I would be interested.
> >>
> >> We looked into H2 and it looks like something similar to Oracle's ODCI
> can
> >> be implemented. Plus the primitive full-text implementación is based on
> >> Lucene.
> >> I say primitive because looking at the code I saw that one cannot define
> an
> >> Analyzer and for each scan corresponding to a where clause a searcher is
> >> open and closed, instead of having a pool, plus it does not have any way
> to
> >> queue changes to reduce the use of the IndexWriter, etc.
> >>
> >> But its open source and that is a great starting point!
> >>
> >> -- Joaquin
> >>
> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
> >> <jason.rutherglen@gmail.com> wrote:
> >>>
> >>> Perhaps an interesting project would be to integrate Ocean with H2
> >>> www.h2database.com to take advantage of both models.  I'm not sure how
> >>> exactly that would work, but it seems like it would not be too
> >>> difficult.  Perhaps this would solve being able to perform faster
> >>> hierarchical queries and perhaps other types of queries that Lucene is
> >>> not capable of.
> >>>
> >>> Is this something Joaquin you are interested in collaborating on?  I
> >>> am definitely interested in it.
> >>>
> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <joaquin.delgado@gmail.com>
> >>> wrote:
> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
> >>> > <otis_gospodnetic@yahoo.com> wrote:
> >>> >>
> >>> >> Regarding real-time search and Solr, my feeling is the focus should
> be
> >>> >> on
> >>> >> first adding real-time search to Lucene, and then we'll figure
out
> how
> >>> >> to
> >>> >> incorporate that into Solr later.
> >>> >
> >>> >
> >>> > Otis, what do you mean exactly by "adding real-time search to
> Lucene"?
> >>> >  Note
> >>> > that Lucene, being a indexing/search library (and not a full blown
> >>> > search
> >>> > engine), is by definition "real-time": once you add/write a document
> to
> >>> > the
> >>> > index it becomes immediately searchable and if a document is
> logically
> >>> > deleted and no longer returned in a search, though physical deletion
> >>> > happens
> >>> > during an index optimization.
> >>> >
> >>> > Now, the problem of adding/deleting documents in bulk, as part of a
> >>> > transaction and making these documents available for search
> immediately
> >>> > after the transaction is commited sounds more like a search engine
> >>> > problem
> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
> to
> >>> > be
> >>> > I/O expensive and thus are usually implemented bached proceeses with
> >>> > some
> >>> > kind of sync mechanism, which makes them non real-time.
> >>> >
> >>> > For example, in my previous life, I designed and help implement a
> >>> > quasi-realtime enterprise search engine using Lucene, having a set
of
> >>> > multi-threaded indexers hitting a set of multiple indexes alocatted
> >>> > accross
> >>> > different search services which powered a broker based distributed
> >>> > search
> >>> > interface. The most recent documents provided to the indexers were
> >>> > always
> >>> > added to the smaller in-memory (RAM) indexes which usually could
> absorbe
> >>> > the
> >>> > load of a bulk "add" transaction and later would be merged into
> larger
> >>> > disk
> >>> > based indexes and then flushed to make them ready to absorbe new
> fresh
> >>> > docs.
> >>> > We even had further partitioning of the indexes that reflected time
> >>> > periods
> >>> > with caps on size for them to be merged into older more archive based
> >>> > indexes which were used less (yes the search engine default search
> was
> >>> > on
> >>> > data no more than 1 month old, though user could open the time window
> by
> >>> > including archives).
> >>> >
> >>> > As for SOLR and OCEAN,  I would argue that these semi-structured
> search
> >>> > engines are becomming more and more like relational databases with
> >>> > full-text
> >>> > search capablities (without the benefit of full reletional algebra
--
> >>> > for
> >>> > example joins are not possible using SOLR). Notice that "real-time"
> CRUD
> >>> > operations and transactionality are core DB concepts adn have been
> >>> > studied
> >>> > and developed by database communities for aquite long time. There has
> >>> > been
> >>> > recent efforts on how to effeciently integrate Lucene into
> releational
> >>> > databases (see Lucene JVM ORACLE integration, see
> >>> >
> >>> >
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
> )
> >>> >
> >>> > I think we should seriously look at joining efforts with open-source
> >>> > Database engine projects, written in Java (see
> >>> > http://java-source.net/open-source/database-engines) in order to
> blend
> >>> > IR
> >>> > and ORM for once and for all.
> >>> >
> >>> > -- Joaquin
> >>> >
> >>> >
> >>> >>
> >>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number
> of
> >>> >> times to understand bits and pieces of it.  I have to admit there
is
> >>> >> still
> >>> >> some fuzziness about the whole things in my head - is "Ocean"
> something
> >>> >> that
> >>> >> already works, a separate project on googlecode.com?  I think so.
>  If
> >>> >> so,
> >>> >> and if you are working on getting it integrated into Lucene, would
> it
> >>> >> make
> >>> >> it less confusing to just refer to it as "real-time search", so
> there
> >>> >> is no
> >>> >> confusion?
> >>> >>
> >>> >> If this is to be initially integrated into Lucene, why are things
> like
> >>> >> replication, crowding/field collapsing, locallucene, name service,
> tag
> >>> >> index, etc. all mentioned there on the Wiki and bundled with
> >>> >> description of
> >>> >> how real-time search works and is to be implemented?  I suppose
> >>> >> mentioning
> >>> >> replication kind-of makes sense because the replication approach
is
> >>> >> closely
> >>> >> tied to real-time search - all query nodes need to see index changes
> >>> >> fast.
> >>> >>  But Lucene itself offers no replication mechanism, so maybe the
> >>> >> replication
> >>> >> is something to figure out separately, say on the Solr level, later
> on
> >>> >> "once
> >>> >> we get there".  I think even just the essential real-time search
> >>> >> requires
> >>> >> substantial changes to Lucene (I remember seeing large patches
in
> >>> >> JIRA),
> >>> >> which makes it hard to digest, understand, comment on, and
> ultimately
> >>> >> commit
> >>> >> (hence the luke warm response, I think).  Bringing other
> non-essential
> >>> >> elements into discussion at the same time makes it more difficult
t
> o
> >>> >>  process all this new stuff, at least for me.  Am I the only one
who
> >>> >> finds
> >>> >> this hard?
> >>> >>
> >>> >> That said, it sounds like we have some discussion going (Karl...),
> so I
> >>> >> look forward to understanding more! :)
> >>> >>
> >>> >>
> >>> >> Otis
> >>> >> --
> >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>> >>
> >>> >>
> >>> >>
> >>> >> ----- Original Message ----
> >>> >> > From: Yonik Seeley <yonik@apache.org>
> >>> >> > To: java-dev@lucene.apache.org
> >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
> >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
> >>> >> >
> >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> >>> >> > wrote:
> >>> >> > > I also think it's got a
> >>> >> > > lot of things now which makes integration difficult to
do
> properly.
> >>> >> >
> >>> >> > I agree, and that's why the major bump in version number rather
> than
> >>> >> > minor - we recognize that some features will need some amount
of
> >>> >> > rearchitecture.
> >>> >> >
> >>> >> > > I think the problem with integration with SOLR is it
was
> designed
> >>> >> > > with
> >>> >> > > a different problem set in mind than Ocean, originally
the CNET
> >>> >> > > shopping application.
> >>> >> >
> >>> >> > That was the first use of Solr, but it actually existed before
> that
> >>> >> > w/o any defined use other than to be a "plan B" alternative
to
> MySQL
> >>> >> > based search servers (that's actually where some of the parameter
> >>> >> > names come from... the default /select URL instead of /search,
the
> >>> >> > "rows" parameter, etc).
> >>> >> >
> >>> >> > But you're right... some things like the replication strategy
were
> >>> >> > designed (well, borrowed from Doug to be exact) with the idea
that
> it
> >>> >> > would be OK to have slightly "stale" views of the data in
the
> range
> >>> >> > of
> >>> >> > minutes.  It just made things easier/possible at the time.
 But
> tons
> >>> >> > of Solr and Lucene users want almost instantaneous visibility
of
> >>> >> > added
> >>> >> > documents, if they can get it.  It's hardly restricted to
social
> >>> >> > network applications.
> >>> >> >
> >>> >> > Bottom line is that Solr aims to be a general enterprise search
> >>> >> > platform, and getting as real-time as we can get, and as scalable
> as
> >>> >> > we can get are some of the top priorities going forward.
> >>> >> >
> >>> >> > -Yonik
> >>> >> >
> >>> >> >
> ---------------------------------------------------------------------
> >>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>> >>
> >>> >>
> >>> >>
> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>> >>
> >>> >
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
>
>
> --
> --Noble Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
Mime
View raw message