lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Lucene in the Humanities
Date Wed, 23 Feb 2005 03:36:27 GMT

On Feb 22, 2005, at 8:50 PM, Chris Hostetter wrote:

>
> : >>> Just curious: it would seem easier to use multiple fields for the
> : >>> original case and lowercase searching. Is there any particular 
> reason
> : >>> you analyzed the documents to multiple indexes instead of 
> multiple
> : >>> fields?
> : >>
> : >> I considered that approach, however to expose QueryParser I'd 
> have to
> : >> get tricky.  If I have title_orig and title_lc fields, how would I
> : >> allow freeform queries of title:something?
>
> Why have seperate fields?
>
> Why not index the title into the "title" field twice, once with each 
> term
> lowercased and once with the case left alone. (Using an analyzer that
> tokenizes "The Quick BrOwN fox" as "[the] [quick] [brown] [fox] [The]
> [Quick] [BrOwN] [fox]")
>
> Then at search time, depending on the value of of the checkbox, 
> construct
> your QueryParser using the appropriate Analyzer.

I assume you mean to stack the tokens in the same positions, so it'd be 
like this:

	[the]	[quick]	[brown]	[fox]
	[The]	[Quick]	[BrOwN]	[fox]

Otherwise, if you simply string it together like what you show, then 
this phrase matches "fox The Quick", which is not in the original 
document.  Though putting in a large gap would do the trick in your 
example.

There is a fiddly issue with this technique that I'm not quite seeing 
at the moment, but I'll brainstorm on it and hopefully remember it or 
perhaps be proven wrong.

I'm Lucene-brain-dead.... I just did a presentation to our local Unix 
Users Group.    I built a man page indexer/searcher with PyLucene 
(thank you Andi!).  I had to learn Python as well, which was a good 
exercise, and learned lots from Andi's helpful private e-mails coaching 
me through my learning curve.  Now that I've seen the beast known as 
Python, I'm yearning for a Ruby version based on GCJ/SWIG.  A local 
Ruby guru and I are planning on meeting for a few hours each week and 
take a stab at it.  I'll commit whatever we do directly to a /ruby 
directory in Subversion.

Here's an example of my PyLucene output:

$ mansearch.py interface section:5
remote - remote host description file
rtadvd.conf - config file for router advertisement daemon
ipnat - IP NAT file format
groff_out - groff intermediate output format
xinetd.conf - Extended Internet Services Daemon configuration file
plist - property list format
racoon.conf - configuration file for racoon
ssh_config - OpenSSH SSH client configuration files
sudoers - list of which users may execute what

Even with custom formatting:

$ mansearch.py --format=#filename interface section:5
/usr/share/man/man5/remote.5
/usr/share/man/man5/rtadvd.conf.5
/usr/share/man/man5/ipnat.5
/usr/share/man/man5/groff_out.5
/usr/share/man/man5/xinetd.conf.5
/usr/share/man/man5/plist.5
/usr/share/man/man5/racoon.conf.5
/usr/share/man/man5/ssh_config.5
/usr/share/man/man5/sudoers.5

suitable for xargs :)

	Erik


>
> The only problem i can think of would be inflated scores for terms that
> are naturally lowercased, because they would wind up getting added to 
> the
> index twice, but based on what i've seen of hte data you are working
> with, i imageing that if you used UPPERCASE instead of lowercase you
> could drasticly reduce the likelyhood of any problems with that.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message