lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Index a source, but not store it... can it be done?
Date Fri, 09 Mar 2007 03:32:52 GMT

: I don't know... hashing individual words is an extremely weak form of
: security that should be breakable without even using a computer... all
: the statistical information is still there (somewhat like 'encrypting'
: a message as a cryptoquote).
: Doron's suggestion is preferable: eliminate token position information
: from the index entirely.

i guess i wasn't thinking about this as a "security" issue, more a
"discouragement" issue ... reconstructing a doc from term vectors is easy,
reconstructing it from just term positions is harder but not impossible,
reconstructing from hashed tokens requires a lot of hard work.

if the issue is thta you want to be abel to ship an index that people can
manipulate as much as they want and you want to garuntee they can never
reconstruct the original docs you're pretty much screwed ... even if you
eliminate all of the position info statistical info about language
structure can help you gleen a lot about hte source data.

i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase queries
would make the hasing approach more worthwhile.

i mean afterall: you still wnat the index to be useful for searching
right? ... if you are really paranoid don't just strip the positions,
strip all duplicate terms as well to prevent any attempt at statistical
sampling ... but now all you relaly have is a lookup table of word to
docid with no tf/idf or position info to improve scoring, so why bother
with Lucene, jsut use a BerkleyDB file to do your lookups.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message