jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Functionality to store indexes in database with jackrabbit 2.1.2 or upcoming releases.........
Date Mon, 29 Nov 2010 13:35:01 GMT
On Mon, Nov 29, 2010 at 2:19 PM, Alexander Klimetschek
<aklimets@adobe.com> wrote:
> On 29.11.10 13:21, "Ard Schrijvers" <a.schrijvers@onehippo.com> wrote:
>>And this is a big burden! I think, we could have a single big index
>>for the JCR spec implementation. But, I wouldn't solve this by having
>>more small indexes, as collections. I would like to have an option, in
>>case of XPath, like 'simpleXPath=true' where we limit some of the
>>options: In other words, not all the jcr spec queries are available,
>>but it is efficient and fast (we at Hippo limit ourselves to only
>>efficient xpath queries). If you do not by default store all
>>properties, and do not have to support complex path constraint (only
>>simple ones), then, you wouldn't have to bother that much about one
>>single Lucene index.
>
> As written in my other mail, there are good reasons to allow for separate
> indexes, to resolve conflicts of different indexing needs for different
> applications. Maybe this is only true for the (node-scoped) full text
> index, where you can't exclude certain properties at query time.
>
> And the big advantage of those collections is that you solve the path
> constraint issue, at least for those queries like:
>
> /content/siteA//*[jcr:contains(., 'term') and @myProp='foo']
>
> because you would have a collection for /content/siteA, /content/siteB,
> etc. with just the right full text / property index.

We achieved this much easier and more flexible, as we have the
*demand* for instant path constraint on any path as well. A little
background first: Jackrabbit has a very nice feature, that jcr nodes
are not aware of their actual location. Only parent and childs are
know. This also holds for the index. This means, that moving a tree
with thousands of nodes is a single node change, both in dbase as in
index.

However, this comes at a price of slow path constraint queries. This
was unacceptable for us. Hence, for a node, we index all parent
elements in a multivalued Lucene field as well. Suppose my location
is: /content/document/news/2009/12/foo . My Lucene field will have the
terms:

 /content
 /content/document
 /content/document/news
 /content/document/news/2009
 /content/document/news/2009/12

So, *any* simple path constraint in our repository, is just matching a
single lucene term, which is instant. Give me all nodes below '
/content/document/news' are just all the nodes that have the term '
/content/document/news' in our predefined Lucene field (note that we
actually use node ids for it, but for the picture, this is easier to
understand)

>
>>Lucene 4.0 will be so blistering fast and efficient...
>
> Cool.
>
>>the figures we
>>need to index with Jackrabbit is peanuts for Lucene. *If* we improve
>>indexing, a couple of hundreds of millions of nodes is a no-brainer!
>
> With the exception of the path constrained, as this is not indexed. Maybe
> it will be easier with Lucene 4.0 to index the path, especially allow for
> fast updates of the path property when something is moved?

Lucene will hardly have improvements for hierarchical structures. Note
that this is exactly what makes jcr indexing so complex: The
hierarchy! For small hierarchies, more on Document kind of level,
there might be added a NestedDocumentQuery: This is to avoid cross
matching see [1]. But this is very simple compared to what Jackrabbit
can do with xpath, and it is still in development


>>We should not be thinking about problems that are a result of the
>>current implementation and its short comings (they are a result that
>>it needed to work against Lucene 1.4, this is no critics to be sure!).
>
> Ok.
>
>>asynchronous indexing is already part of the jcr 283 afaik and is
>>allowed, certainly for binary content
>
> Sure, but still indexing takes a major part of a save() call, AFAIK.

True...and the more important that just one node in a cluster does the
actual indexing (or extraction like from pdf, even more important!)

Regards Ard

[1] https://issues.apache.org/jira/browse/LUCENE-2454

> Regards,
> Alex
>
> --
> Alexander Klimetschek
> Developer // Adobe (Day) // Berlin - Basel
>
>
>
>
>



-- 
Hippo
Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522 4466
USA  • San Francisco  185 H Street Suite B  •  Petaluma CA 94952-5100
•  +1 (707) 773 4646
Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC H2T
1S5  •  +1 (514) 316 8966
www.onehippo.com  •  www.onehippo.org  •  info@onehippo.com

Mime
View raw message