db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: [jira] Commented: (DERBY-472) Full Text Indexing / Full Text Search
Date Tue, 26 Jul 2005 23:19:50 GMT
Dan, do you have a vision on how such an integration would work.  I
understand using the libraries to process the CONTAINS function.  You
mentioned you tried some prototyping, did you put the actual index into
an existing derby table or did you store the index separate?

My questions are:

1) would actual index be stored in a derby style page container, which
would somehow map the index structure that lucene expected?  Does the
lucene architecture map to a row locked multi-user environment?

2) Would there be work in the optimizer to get it to recognize that
there was an index that helped with the CONTAINS() function.  Is this
the sort of work needed for DERBY-455 creation of indexes on expressions?

Daniel John Debrunner (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/DERBY-472?page=comments#action_12316814 ]

> Daniel John Debrunner commented on DERBY-472:
> ---------------------------------------------
> Rather than invent text indexing, Derby should use the Apache  text search library  -
> http://lucene.apache.org/
> Then you would have a text query language like Google, complete with language support
> SELECT * FOM ARTICLES where CONTAINS('+Apache +Derby -hat') where rank > 0.8 order
by rank
> or something like that.
>>Full Text Indexing / Full Text Search
>>         Key: DERBY-472
>>         URL: http://issues.apache.org/jira/browse/DERBY-472
>>     Project: Derby
>>        Type: New Feature
>>  Components: SQL
>>    Versions:
>> Environment: All environments
>>    Reporter: Rick Hillegas
>>Efficiently support full text search of string datatyped columns. Mag Gam raised this
issue on the user's mailing list on 24 July 2005; the email thread is titled 'Full Text Indexing'.
Mag wants to see something akin to the functionality in tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/).
Dan points out that we may be able to re-use index building technology exposed by the apache
Lucene project (http://lucene.apache.org/).
>>Presumably we want to build inverted indexes on all string datatyped columns: CHAR,
VARCHAR, LONG VARCHAR, CLOB,, and their national variants (when they are implemented). We
should consider the following additional issues when specifying this feature:
>>1) Do we also want to support text search on XML columns?
>>2) Which human languages do we support initially? Each language has its own rules
for lexing words and its own list of "noise" words which should not be indexed. Hopefully,
we can plug-in some existing packages of lexers and noise filters. We should encourage users
to donate additional lexers/fitlers.
>>3) The CREATE INDEX syntax (for these new inverted indexes)  should let us bind a
lexing human language to a string-datatyped column.
>>4) How do we express the search condition? For case-sensitive searches we can get
away with boolean expressions built out of standard LIKE clauses. However, in my opinion,
case-sensitive searches are an edge case. The more useful situation is a case-insensitive
search. Can we get away with introducing a non-standard function here or do we need to push
a proposal through the standards commitees? Even more useful and non-standard are fuzzy searches,
which tolerate bad spellers.

View raw message