Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Message-ID: <339046402.1223049286257.JavaMail.jira@brutus>
Date: Fri, 3 Oct 2008 08:54:46 -0700 (PDT)
From: "Rick Hillegas (JIRA)" <jira@apache.org>
To: derby-dev@db.apache.org
Subject: [jira] Commented: (DERBY-472) Full Text Indexing / Full Text Search
In-Reply-To: <1731265072.1122415098848.JavaMail.jira@ajax.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/DERBY-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636662#action_12636662 ] 

Rick Hillegas commented on DERBY-472:
-------------------------------------

Hi Suran,

I think that this would be a great contribution. It might be good to start out by defining what you mean by Lucene/Derby integration. Here are some possibilities. Full disclosure: it has been a long time since I looked at Lucene and I have not prototyped any of the possibilities listed below:

1) Be able to run Lucene queries against text stored in Derby.

I believe that this has been working for a long time. Lucene can index documents that are stored inside RDBMSes, including Derby. Documents are persisted in Derby and the customer application periodically tells Lucene to re-index those documents (say once a night). The application runs queries against the Lucene indexes and then retrieves the relevant documents from Derby when the customer wants to drill down.

If that's all you want to do, then you may just need to write a primer, explaining how a developer would set up a combined Derby/Lucene installation.

2) Be able to do (1) and join Lucene results against other Derby data.

I believe that you can do this today without modifying Lucene or Derby. You need to do something like the following:

a) Write triggers which update the Lucene indexes when the customer inserts or updates a document.

b) Write a table function which is handed a Lucene query. The table function returns the relevant document ids. Those ids can then be joined back to Derby tables to retrieve documents.

No doubt there are a lot of interesting issues involved with keeping Derby and Lucene in sync.

The table function might look like this:

create function luceneQuery
(
    queryText             varchar( 32672 ),
    documentDomain varchar( 32672 ),
    rankCutoff           double
)
returns table
(
   documentDomain  varchar( 32672 ),
   documentID          int,
   rank                     double
)
language java
parameter style DERBY_JDBC_RESULT_SET
contains sql
external name 'LuceneSupport.luceneQuery'

Then you could pose Derby queries like this:

select reports.id, reports.document, luceneResults.rank
from docSchema.reports,
        table( luceneQuery( '+Apache +Derby -hat', 'docSchema.reports', 0.8 ) ) luceneResults
where reports.id = luceneResults.documentID
order by luceneResults.rank

If that's good enough, then your task might just be providing a template trigger and table function plus writing a whitepaper explaining how to wire the templates into an application.


3) Be able to do (2) with great performance.

It may be that (2) has some performance or scalability problems. Lucene or Derby might need to be enhanced to make (2) perform well.


4) Be able to do (3) easily.

A DBA might want help in setting up a combined Derby/Lucene installation. A DBA might want to backup and restore the Derby and Lucene databases together. Making this easily usable might require writing some tools and/or putting some hooks into Lucene and Derby.


5) Something else. What's on your mind?


> Full Text Indexing / Full Text Search
> -------------------------------------
>
>                 Key: DERBY-472
>                 URL: https://issues.apache.org/jira/browse/DERBY-472
>             Project: Derby
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 10.0.2.0
>         Environment: All environments
>            Reporter: Rick Hillegas
>
> Efficiently support full text search of string datatyped columns. Mag Gam raised this issue on the user's mailing list on 24 July 2005; the email thread is titled 'Full Text Indexing'. Mag wants to see something akin to the functionality in tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/). Dan points out that we may be able to re-use index building technology exposed by the apache Lucene project (http://lucene.apache.org/).
> Presumably we want to build inverted indexes on all string datatyped columns: CHAR, VARCHAR, LONG VARCHAR, CLOB,, and their national variants (when they are implemented). We should consider the following additional issues when specifying this feature:
> 1) Do we also want to support text search on XML columns?
> 2) Which human languages do we support initially? Each language has its own rules for lexing words and its own list of "noise" words which should not be indexed. Hopefully, we can plug-in some existing packages of lexers and noise filters. We should encourage users to donate additional lexers/fitlers.
> 3) The CREATE INDEX syntax (for these new inverted indexes)  should let us bind a lexing human language to a string-datatyped column.
> 4) How do we express the search condition? For case-sensitive searches we can get away with boolean expressions built out of standard LIKE clauses. However, in my opinion, case-sensitive searches are an edge case. The more useful situation is a case-insensitive search. Can we get away with introducing a non-standard function here or do we need to push a proposal through the standards commitees? Even more useful and non-standard are fuzzy searches, which tolerate bad spellers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.