lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Boston <...@tfd.co.uk>
Subject Re: Ocean Documentation
Date Tue, 15 Jul 2008 09:55:01 GMT
If you are looking for another example along the same principals,  
(but considerably less sophisticated :) ), see
https://source.sakaiproject.org/svn//search/trunk/search-impl/

This manages a queue of change events on items to be indexed, that  
queue is processed by a cluster of search indexer machines  that  
creates a stream of transaction logs containing segments. The search  
server nodes take that stream of transaction logs and merge them into  
their local indexes, optimizing periodically.

The code is based on Lucene 1.9.1 (waiting for an upgrade) and so  
performs its own transactions above the lucene layer (no transactions  
in 1.9.1 :( ).

It also performs consolidation of the transaction log of segments to  
reduce the cost of adding a new search node, although this is rather  
expensive.

The main difference between this an Jackrabbit (which we also use as  
a JCR) is that Jackrabbit performs indexing on each node, injecting  
directly into the lucene index, whereas this parallelizes the  
indexing operation. So the lag between a item appearing in the  
Jackrabbit index is very low, typically < 1s, but the cpu load of  
indexing in not scalable with the number of indexing nodes. The  
downside of parallel indexing is the delay, as documents need to be  
batched to avoid excessive merge activity, and the network bandwidth  
consumed by the transaction log and snapshots.

The method we used would never work for Jackrabbit as it uses the  
search index for query parsing (JCR-SQL and JCR-XQuery)...... and  
IMHO, the Jackrabbit approach is more elegant... but it would be nice  
to have it parallelize the indexing operation.

Hope that gives some contrast.
Ian

BTW, I understand Lucene 2.3 is much faster than 1.9, so I should  
upgrade?


On 14 Jul 2008, at 22:05, Jason Rutherglen wrote:

> I took a look at Jackrabbit, which are a very cool animal, and  
> there are similar ideas in the Lucene portion.  I will try to take  
> a look at the source to get a better understanding.
>
> On Fri, Jul 11, 2008 at 9:09 AM, Ard Schrijvers  
> <a.schrijvers@onehippo.com> wrote:
> Hello Jason et al,
>
> Indeed there are plenty of usecases of instantly needed updated
> searches, for example the jsr-170 (jcr) compliant Jackrabbit
> implementation: it havily relies on lucene for searching and hierarchy
> resolving, and according jsr-170 spec after a save(), changes need  
> to be
> visible instantly.
>
> Also, I think a very similar solution to yours is implemented  
> there: See
> [1] if you like
>
> Regards Ard
>
> [1] http://jackrabbit.apache.org/index-readers.html
>
>
>
> > I started a wiki name at
> > http://wiki.apache.org/lucene-java/OceanRealtimeSearch linked
> > from http://wiki.apache.org/lucene-java/LuceneResources.
> >
> > Perhaps I should add some background on the wiki.  I can add
> > a little bit here.  I was an early Solr developer/user at a
> > social networking company when Google's GData came out.  It
> > looked similar to Solr so I took a look at it.  The one thing
> > it had over Solr was realtime updates or the ability to add,
> > delete, or update a document and be able to see the update in
> > search results immediately.  With Solr the company had
> > decided on a 10 minute interval of updating the index with
> > delta updates from an Oracle database.  I wanted to see if it
> > was possible with Lucene to create an approximation of what
> > GData does.  The result is Ocean.
> >
> > The use case it was designed for is websites with dynamic
> > data, some of which are social networking, photo sites,
> > discussions boards, blogs, wikis, and such.  More broadly it
> > is possible to use Ocean with any application that requires
> > the database like feature of immediate updates.  Probably the
> > best example of this is all of Google's web applications,
> > outside of web search, uses a GData interface.  Meaning the
> > primary datastore is not mysql or some equivalent, it is a
> > proprietary search based database.  The best example of this
> > is Gmail.  If I receive an email through Gmail I can also
> > search on it immediately, there is no 10 minute delay.  Also
> > in Gmail I can change labels, a common example being changing
> > unread emails to read in bulk.  Presumably Gmail is not
> > reindexing the entire email for each label change.
> >
> > Most highly trafficked web applications do not use the
> > relational facilities like joins because they are too
> > expensive.  Lucene does not offer joins so this is fine.  The
> > only area Lucene is currently weak in is range queries.
> > Mysql uses a btree index whereas Lucene uses the time
> > consuming TermEnum and TermDocs combination.  This is an area
> > Tag Index addresses.
> >
> > The way Ocean is designed there should be no limitations to
> > using it compared to using Lucene IndexWriter.  It offers the
> > same functionality.  If one does not want to use the
> > transaction log Ocean offers because one simply wants to
> > index 1 million documents at once, Ocean offers what is a
> > called a LargeBatch.  It is a way to perform a large number
> > of updates taking advantage of the new IndexWriter speedup,
> > combined with transactional semantics.
> >
> > Karl, does this answer your question or are there areas that
> > could use more explanation?
> >
> >
> > On Fri, Jul 11, 2008 at 6:20 AM, Karl Wettin
> > <karl.wettin@gmail.com> wrote:
> >
> >
> >
> >       10 jul 2008 kl. 22.08 skrev Jason Rutherglen:
> >
> >
> >
> >               Is there a good place to put Ocean
> > https://issues.apache.org/jira/browse/LUCENE-1313
> > documentation?  Is there a place on the wiki that is good?
> >
> >
> >
> >       Hi Janson,
> >
> >       the wiki is just fine.
> >
> >       I've been reading the docs and looked at your patch.
> > There is a lot of text about how it does what it does, but it
> > says nothing anything about the intended use. I honestly
> > don't even know what you mean by "real time search". You will
> > probably get more attention if the documentation starts out
> > with some use cases or thoughts on when and why it might make
> > sense to use your code.
> >
> >
> >             karl
> >
> >
> >  
> ---------------------------------------------------------------------
> >       To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >       For additional commands, e-mail: java-dev- 
> help@lucene.apache.org
> >
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message