incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Spicar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CLEREZZA-388) Composite Resource Index Service
Date Thu, 17 Mar 2011 12:43:35 GMT

    [ https://issues.apache.org/jira/browse/CLEREZZA-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007902#comment-13007902
] 

Daniel Spicar commented on CLEREZZA-388:
----------------------------------------

I'd like give some feedback from some use-case scenario experience.

The use case is that I have a web site with a search interface that allows me to search for
users on the platform. I'd like to be able to search "intuitively". This means when I enter
"jessica" i expect all users where jessica appears in the name string as a single word. A
rough specification is:
- exact string matching with double quotes ("phrase").
- wildcard matching (*,?)
- case-insensitive search ('jessica' and 'Jessica' should deliver the same results)
- boolean condtitions for search terms (AND, OR, NOT)

Lucene provides a QueryParser that supports most of these things and even more (fuzzy searches,
range searches, etc). --> http://lucene.apache.org/java/3_0_0/queryparsersyntax.html

Thus I implemented my own Condition that uses the QueryParser on the user input to generate
a query.

But I faced some problems which need to be resolved in CRIS:
1. CRIS indexes named resources with the Field.Index.NOT_ANALYZED attribute. This means the
index is not tokenized and it is case-sensitive.
2. CRIS is currently hard-coded to deliver the top 10 results. For this use case this would
need to be configurable though.

Concerning problem 1:
I resolved it locally by adding another field to the indexed document:
doc.add(new Field(vProperty.stringKey, propertyValue, Field.Store.YES, Field.Index.ANALYZED))

Because CRIS uses the StandardAnalyzer this means that in that new field the words are tokenized,
common English stop words (like "a") are omitted, and the index is (according to my understanding)
lower-case.
This means that now there is a field with the exact value, and another field with a lower-case,
tokenized index.

The consequences from this solution are that it would be good it the GraphIndexer could somehow
expose the Lucene Version attribute and the Analyzer that it uses on the public interface
so custom conditions (like mine) can use the same Analyzer as the index has been written with.

I'll attach the GenericCondition, GraphIndexer, ResourceFinder files for reference. It is
not production level code though.

> Composite Resource Index Service
> --------------------------------
>
>                 Key: CLEREZZA-388
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-388
>             Project: Clerezza
>          Issue Type: New Feature
>            Reporter: Reto Bachmann-Gmür
>            Assignee: Reto Bachmann-Gmür
>
> A service shall monitor a graph for resource of a specific typed and provide composite
indexes on specified properties. It shall support searching by exact value, by range as well
as full-text search. This service shall make it possible to provide fast faceted searches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message