lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hackl, Rene" <Rene.Ha...@FIZ-Karlsruhe.DE>
Subject Re: Can use Lucene be used for this
Date Thu, 13 Nov 2003 08:22:57 GMT
Hi John,

Indeed, the RCO index is ok for prefix-style wildcards. But it doesn't work
for _simultaneous_ left and right truncation ("*oba*"). I have no idea about
how often this kind of search is actually employed, but in this particular
context it is really needed (I sketched this before on this list, in brief:
documents contain very long strings for chemical substances, users are
interested in certain parts of the string e.g. find all documents that
comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene"). 

Suggestions on improvements are always welcome! :-)

Best regards,
René


-----Ursprüngliche Nachricht-----
Von: Majerus, John P. [mailto:Majerus.John@mayo.edu]
Gesendet am: Donnerstag, 13. November 2003 00:41
An: 'Lucene Users List'
Betreff: RE: Can use Lucene be used for this

Hello,
This has probably been put forth on the list before, but how about the
following approach for leftmost wildcard searches, at least for single term
searches?

Reverse the character order of all words after they're stemmed and before
they're added to a special reverse-character-order index. Any time a
wildcard was found at the beginning of the search term the special index
would be engaged. Then a search for "*bar" would be converted to a search
for "rab*" on the RCO index, and the search would find "raboof", and this
result would then be unreversed upon display to yield: "foobar". 

Rene's special index could be several times larger in entry count, depending
on the average length of the contained terms. A reverse-character-order
index is the same size as its regular counterpart.

Cheers,
John
-----Original Message-----
From: Hackl, Rene [mailto:Rene.Hackl@FIZ-Karlsruhe.DE]
Sent: Wednesday, November 12, 2003 6:34 AM
To: 'Lucene Users List'
Subject: Re: Can use Lucene be used for this


>> col2 like %aa%

> Lucene doesn't handle queries where the start of the term is not known
> very efficiently.

Is it really able to handle them at all? I thought "*foo"-type queries were
not supported.

That's because I build two indexes for the purpose of simultaneous left and
right truncation. One "normal" index and another special one, which takes
tokens and breaks them down, for instance "foobar" would be indexed also as
"oobar" and "obar". For a query "*oba*" the left wildcard would cause the
special index to be searched for "oba*", not left truncated queries would
search the normal index.

The special index is created with maxFieldLength = 100000

build-time specialIndex vs. normalIndex: +60%
index size specialIndex vs. normalIndex: +240%
index size specialIndex vs. originalDocSize: +60%

Query execution is still very fast on a 3GB specialIndex. 

I guess the usability depends on how large your document collection is and
what kind of search functionality you need. The drawbacks of this approach
are that proximity and phrase searches on the special index are busted. 

Would it make sense to prevent creating the prx-file to reduce index size
when not offering that kind of search anyway? Is it possible at all?

Best regards,
René

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message