lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: posting list strings
Date Tue, 09 Jul 2013 21:34:50 GMT
Hi,

You can replace the term by their hash directly in the analyzer chain. Just write a custom
TermToBytesRef attribute that hashes the term to a constant-length byte[] (using a AttributeFactory)!
:-) This would give you all features of hashed, constant length terms, but you would lose
prefix and wildcard queries. In fact, NumericTokenStream is doing this for numeric!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Adrien Grand [mailto:jpountz@gmail.com]
> Sent: Tuesday, July 09, 2013 11:25 PM
> To: java-user@lucene.apache.org
> Subject: Re: posting list strings
> 
> Hi,
> 
> Lucene stores the string because it may need it to run prefix or range
> queries. We don't have a hash-based terms dictionary right now but I know
> some people wrote one since they don't need support for these queries, see
> for instance the Earlybird paper[1]. Then if you can find a perfect hashing
> function, you can just replace your terms by their hash.
> 
> [1]
> http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.
> pdf
> 
> --
> Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message