incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Spicar <daniel.spi...@trialox.org>
Subject Re: Composite Resource Indexing Service: ready for review.
Date Thu, 16 Jun 2011 15:06:37 GMT
After investigating the effects of optimize() and different strategies of
when to use it I am worried about using it in the current implementation of
GraphIndexer/LuceneTools. Given the high complexity of GraphIndexer I am
afraid making some fundamental changes now will break something. We already
spent a lot of time making sure the IndexSearcher/Writers are opened and
closed without causing fatal errors and limiting the use of file
descriptors. I strongly discourage from making hasty changes to this and
many approaches of making (more) intelligent use of optimize() would IMHO
require some changes. We have some problems that should be adressed:
 - I am not sure how many IndexReaders are opened by Lucene internally when
web resources using Graphindexer.findResource methods are called
cuncurrently.
 - as far as I see it we do not close the IndexWriter unless we close the
entire GraphIndexer

These issues make it difficult to decide whether the IndexWriter is really
idle or not (from what I have been reading, however the info I have is
limited to Lucene 2.3 - maybe someone knows how this is addressed in 3.0?).
If optimize is called on an IndexWriter while searches are going on this
will theoretically work but there is a hazard of running out of disk space
as temporary disk space usage can be a multiple of the index size (the
maximum limit is at 3-4 times the index size according to my understanding.
But if Lucene opens more IndexReaders than expected by me it may be even
higher.) Another problem is that optimize can not delete some segments when
IndexReaders are open on them.

As I see it there are three strategies we may choose:
- never optimize (it can be that we do not run into problems) and offer an
optimize method on GraphIndexer that can be called manually by an admin if
needed.
- settle for optimization during "lull" times. Like a CRON job that runs at
a time where little activity is expected.
- make some "intelligent" mechanism that can decide when do optimize
(deciding based on number of segments in the index or after certain amount
of newly indexed documents or recognizing bulk additions/deletions, etc)

Running optimize in an extra thread is a good idea for all strategies.
However for all strategies but the first I think we should again review how
we open and close the IndexReaders and Writers. But because of experience
with this issue I suggest to take enough time to properly test all these
changes because it is very easy to break things. This would be a bigger
issue then.

For a quick change before release (because the way we optimize currently is
really not good) I suggest to remove the optimize entirely as an automated
process and have some place for admins to trigger optimize manually (and
make it run in a thread). Not optimizing will not break anything. But
searches may become gradually slower if we update the index often. That
depends on the application using the GraphIndexer anyway, so optimize may
not be required for all applications until we come up with a better
solution.

Any thoughts?

Best,
Daniel


On Sun, Jun 12, 2011 at 8:49 PM, Tommaso Teofili
<tommaso.teofili@gmail.com>wrote:

> Hello all,
> after reviewing the CRIS module I am +1 for committing it.
> Regarding LuceneTools it looks good to me, only I'm not too sure about the
> call to optimize() just before the IndexWriter gets closed, in fact that
> could lead to long operations whit large indexes one could decide to
> postpone (the IndexWriter is exposed so explicit calls are allowed); I'd
> suggest to add a parameter to decide wether the optimize() should be called
> on close and set it to true by default. A possible enhancement could be to
> create an OptimizeThread with a timeInterval constructor which optimizes
> the
> index, if there is not "much" activity (i.e.: analyze isLocked, InfoStream,
> etc), every timeInterval seconds.
> Great work guys :)
> Tommaso
>
>
> 2011/6/3 Tommaso Teofili <tommaso.teofili@gmail.com>
>
> >
> >
> > 2011/6/2 Tsuyoshi Ito <tsuy.ito@trialox.org>
> >
> >> Dear all
> >>
> >> Composite Resource Indexing Service is now ready for review (issue
> >> CLEREZZA-501). Junit Tests and documentation is available (install
> >> rdf.cris/core on clerezza and search for Composite Resource Indexing
> >> Service under /documentation)
> >>
> >> excerpt:
> >>
> >> CRIS is based on Apache Lucene and provides means to index RDF
> >> resources. It works by indexing the values of properties on a
> >> resource. This enables to search for the property values using CRIS.
> >> The results that CRIS delivers are the corresponding RDF resources.
> >>
> >> GraphIndexer
> >> The core of CRIS is the GraphIndexer class. Note that GraphIndexer is
> >> not an OSGi service, but it has to be instantiated by the user to
> >> provide an index. The GraphIndexer needs two graphs to work with. One
> >> graph contains the IndexDefinitions, that is the specification of
> >> which resources and properties to index (see IndexDefinitionManager).
> >> The other graph is the the graph that contains the resources to index.
> >> Note that CRIS indexes RDF resources based on their rdf:type and that
> >> the indexing works on a per-property basis. That means, not all
> >> properties on a resource are indexed by default. The user has to
> >> specify which properties to index.
> >> GraphIndexer also provides the interface to search for resources using
> >> the findResources method. The search is specified using Conditions and
> >> optionally a SortSpecification and FacetCollectors. The findResources
> >> method is overloaded with methods that allow the specification of the
> >> resource type and search query directly.
> >>
> >>
> >> IndexDefinitionManager
> >> The IndexDefinitionManager helps to manage indexing specifications
> >> using the CRIS ontology in the index definition graph (see
> >> GraphIndexer). Indexing is enabled for resources according to their
> >> rdf:type. Additionally the index definitions specify the properties of
> >> the resource that are indexed.
> >>
> >> One can think of an index definition as specifying the keys
> >> (properties) that are mapped to the value (the resource URI) in the
> >> index.
> >>
> >> ....
> >>
> >>
> >> Note:
> >> - GraphIndexer is quit complex and has many responsibilities.
> >> - No other clerezza project depends on Composite Resource Indexing
> >> Service.
> >> - GraphIndexer is available as Platform CRIS Service in project
> >> platform.cris (for the contentgraph incl. additions)
> >>
> >> @Tommaso
> >> Lucene is used in LuceneTools.java in rdf.cris/core. Feedback
> >> appreciated - I have little experience with lucene, so feel free to
> >> improve it. Especially I am not sure when to call optimize (see
> >> comment in LuceneTools)
> >>
> >
> > Uber cool Tsuy and others, I'll definitely have a deep look there, thanks
> > for this awesome work!
> > Tommaso
> >
> >
> >>
> >> Thanks to Reto, Daniel and Hasan for the work! We already use it in a
> >> monitoring tool - the performance is outstanding compared to the
> >> available alternatives in clerezza (filter resp. sparql)
> >>
> >> Cheers
> >> Tsuy
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message