db-derby-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Db-derby Wiki] Update of "LuceneIntegration" by RichardHillegas
Date Tue, 25 Oct 2005 21:38:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by RichardHillegas:
http://wiki.apache.org/db-derby/LuceneIntegration

New page:
This page continues a discussion about how to integrate Lucene
with Derby.
[http://lucene.apache.org/  Lucene]
is an Apache text search engine.
The discussion began on the Derby user mail
list with the 
[http://mail-archives.apache.org/mod_mbox/db-derby-user/200507.mbox/browser  Full Text Indexing]
thread.
JIRA enhancement request
[http://issues.apache.org/jira/browse/DERBY-472  DERBY-472]
tracks this discussion.

This page briefly describes Lucene's capabilities and then explores
text-searching features and use cases which Derby might support.
Please feel free to expand this list of features and use cases.

[[TableOfContents]]

== Lucene's Capabilities ==

Lucene provides a java library for indexing and searching
documents. Lucene ships with English, German, and Russian support and
you can find plugins for other languages, including Chinese, Japanese,
and Korean.
Plugins exist for the following document formats:

|| plain text|| html|| xml|| Open Office|| Word|| Excel|| Powerpoint|| IMAP mail|| RTF|| PDF||

The following high level concepts drive Lucene's design:

 * '''Crawling''' - A component crawls through some repository (say a web, filesystem, or
database), looking for documents to index.
 * '''Analyzing''' - The resulting documents are analyzed into useful terms:
   * ''Lexing'' - The text is broken up into language-specific words.
   * ''Stemming'' - Inflectional markers are stripped and words are reduced to standard forms.
For instance, English possessives, plurals, and tenses disappear and the words bat, bats,
bat's, batted, and batting all become the word bat.
   * ''Stopping'' - Noise words (like "the" and "an") are thrown away.
 * '''Indexing''' - An index is built keyed by useful terms. For each useful term, the index
tracks various statistics including the term's word offsets into documents.
 * '''Querying''' - Complex queries can be built out of words and phrases, arbitrarily connected
by ANDs, ORs, and NOTs. Queries allow exact matches and various kinds of fuzzy matches. Queries
may be expressed in a text-based query language or as graphs of Lucene search objects.
 * '''Filtering''' - Query results may be run through noise filters to sift out irrelevant
documents.
 * '''Hits''' - Filtered query results, sorted by relevance, appear as lists of document hits.

== Features We Want ==

Integrating Lucene with Derby may involve some or all of the following
features. Probably we would phase in features over a number of
releases.

 * '''Complex Searches''' - Text-search documents. Restrict the search by metadata that is
stored in Derby. Join search results with supplementary information stored in Derby.
 * '''Administration''' - Be able to use off-the-shelf tools to maintain and optimize Lucene
indexes.
 * '''Import/Export''' - Rapid import/export of text-searchable documents.
 * '''Security''' - Restrict text-searching to authorized documents.
 * '''Recovery''' - Recover text-search indexes after a crash.
 * '''Parallelism''' - Be able to throw many processors at a text-search.
 * '''Plugins''' - Lucene support should not bloat up the core Derby release.
 * '''Customizing''' - Customers should be able to supply their own analyzers and filters
and store these in the database.
 * '''Query API''' - Customers should be able to express queries with Lucene's query language
or with graphs of Lucene search objects.
 * '''Convenience''' - Make it easy to declare which document fields appear in Lucene indexes
and which are stored in columns.


== Use Cases to Support ==

||'''Use Case'''||'''Description'''||'''Example'''||
||Loose Coupling||Store documents outside Derby in a filesystem or web.||Web-advertising:
Maintain a searchable web of content. When the user searches for content, return web pages
as well as advertising jsps bound to certain keywords.||
||Moderate Coupling||Store documents inside Derby but maintain text-search indexes outside
Derby in a filesystem. Provides transactional versioning and audit trail for documents which
can be text-searched.||Law office: Be able to transactionally store legal documents and search
for them later.||
||Tight Coupling||Transactionally store documents and text-search indexes inside Derby.||Online
market: Be able to search for an item immediately after its description is posted.||

== Issues ==

 * '''Index Latency''' - Probably the first phase of Lucene support will not store the Lucene
indexes in the database. There will be some sort of lag between storing a document and seeing
it appear in searches. How long can this lag be? A minute? An hour? A day? Similarly, after
a crash, we may need to rebuild the Lucene indexes. How long can this rebuilding take?

Mime
View raw message