db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Hillegas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-472) Full Text Indexing / Full Text Search
Date Wed, 09 Sep 2009 18:52:57 GMT

    [ https://issues.apache.org/jira/browse/DERBY-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753211#action_12753211

Rick Hillegas commented on DERBY-472:

A little more information on standards which apply to this work. The ANSI/ISO SQL standard
does address full-text search in part 2 of the section titled SQL/MM (SQL Multimedia and Application
Packages). The approach centers on a special full-text datatype, which is an ADT. The ADT
has many methods defined on it, some of which return BOOLEAN values. SQL/MM is part of the
1999 version of the standard. I asked the ANSI SQL committee if anyone on the committee had
implemented the spec. No-one replied. I think that means that no major vendor has implemented
this part of the SQL spec.

You can get a copy of part 2 of SQL/MM here: http://www.wiscorp.com/SQLStandards.html

At this time, I don't recommend building Derby text search capabilities on top of the SQL
standard--for these reasons:

1) I question the usefulness of a spec which has not been implemented by any major vendor.

2) Derby does not currently support ADTs.

3) Derby does not currently support a BOOLEAN datatype.

> Full Text Indexing / Full Text Search
> -------------------------------------
>                 Key: DERBY-472
>                 URL: https://issues.apache.org/jira/browse/DERBY-472
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions:
>         Environment: All environments
>            Reporter: Rick Hillegas
> Efficiently support full text search of string datatyped columns. Mag Gam raised this
issue on the user's mailing list on 24 July 2005; the email thread is titled 'Full Text Indexing'.
Mag wants to see something akin to the functionality in tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/).
Dan points out that we may be able to re-use index building technology exposed by the apache
Lucene project (http://lucene.apache.org/).
> Presumably we want to build inverted indexes on all string datatyped columns: CHAR, VARCHAR,
LONG VARCHAR, CLOB,, and their national variants (when they are implemented). We should consider
the following additional issues when specifying this feature:
> 1) Do we also want to support text search on XML columns?
> 2) Which human languages do we support initially? Each language has its own rules for
lexing words and its own list of "noise" words which should not be indexed. Hopefully, we
can plug-in some existing packages of lexers and noise filters. We should encourage users
to donate additional lexers/fitlers.
> 3) The CREATE INDEX syntax (for these new inverted indexes)  should let us bind a lexing
human language to a string-datatyped column.
> 4) How do we express the search condition? For case-sensitive searches we can get away
with boolean expressions built out of standard LIKE clauses. However, in my opinion, case-sensitive
searches are an edge case. The more useful situation is a case-insensitive search. Can we
get away with introducing a non-standard function here or do we need to push a proposal through
the standards commitees? Even more useful and non-standard are fuzzy searches, which tolerate
bad spellers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message