Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Message-ID: <1641866269.1222884584476.JavaMail.jira@brutus>
Date: Wed, 1 Oct 2008 11:09:44 -0700 (PDT)
From: "Suran Jayathilaka (JIRA)" <jira@apache.org>
To: derby-dev@db.apache.org
Subject: [jira] Commented: (DERBY-472) Full Text Indexing / Full Text Search
In-Reply-To: <1731265072.1122415098848.JavaMail.jira@ajax.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/DERBY-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636111#action_12636111 ] 

Suran Jayathilaka commented on DERBY-472:
-----------------------------------------

Hi,

The postgraduate course I'm following requires us to complete  a development/research project which will approximately span the duration of one year. Having had a wonderful experience with working with the Derby community for my Google Summer of Code project, I am considering taking up some feature addition to Derby as my degree project.

After discussing a bit with Kathey Marsden, I think this issue, Lucene integration for Derby for full text indexing / full text search might be a good issue to take up. 

But I am not quite clear on the amount of workload/timeframes that might be required to see it to completion.
Therefor, I would be grateful if I could get some input as to how this task could be broken up into viable subtasks, and a brief idea of what they might involve.

Thanks.
Suran

> Full Text Indexing / Full Text Search
> -------------------------------------
>
>                 Key: DERBY-472
>                 URL: https://issues.apache.org/jira/browse/DERBY-472
>             Project: Derby
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 10.0.2.0
>         Environment: All environments
>            Reporter: Rick Hillegas
>
> Efficiently support full text search of string datatyped columns. Mag Gam raised this issue on the user's mailing list on 24 July 2005; the email thread is titled 'Full Text Indexing'. Mag wants to see something akin to the functionality in tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/). Dan points out that we may be able to re-use index building technology exposed by the apache Lucene project (http://lucene.apache.org/).
> Presumably we want to build inverted indexes on all string datatyped columns: CHAR, VARCHAR, LONG VARCHAR, CLOB,, and their national variants (when they are implemented). We should consider the following additional issues when specifying this feature:
> 1) Do we also want to support text search on XML columns?
> 2) Which human languages do we support initially? Each language has its own rules for lexing words and its own list of "noise" words which should not be indexed. Hopefully, we can plug-in some existing packages of lexers and noise filters. We should encourage users to donate additional lexers/fitlers.
> 3) The CREATE INDEX syntax (for these new inverted indexes)  should let us bind a lexing human language to a string-datatyped column.
> 4) How do we express the search condition? For case-sensitive searches we can get away with boolean expressions built out of standard LIKE clauses. However, in my opinion, case-sensitive searches are an edge case. The more useful situation is a case-insensitive search. Can we get away with introducing a non-standard function here or do we need to push a proposal through the standards commitees? Even more useful and non-standard are fuzzy searches, which tolerate bad spellers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.