lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Klaas" <>
Subject Re: Index a source, but not store it... can it be done?
Date Fri, 09 Mar 2007 04:04:18 GMT
On 3/8/07, Chris Hostetter <> wrote:

> if the issue is thta you want to be abel to ship an index that people can
> manipulate as much as they want and you want to garuntee they can never
> reconstruct the original docs you're pretty much screwed ... even if you
> eliminate all of the position info statistical info about language
> structure can help you gleen a lot about hte source data.


> i'm not crypto expert, but i imagine it would probably take the same
> amount of statistical guess work to reconstruct meaningful info from
> either approach (hashing hte individual words compared to eliminating the
> positions) so i would think the trade off of supporting phrase queries
> would make the hasing approach more worthwhile.

I suppose it also depends on how much access the user has to the
index.  If they have access to the physical index and means of
querying it, then they have access to the hashing algo (and/or key)
and so it is worthless.  If they don't, and their access is strictly
through queries, then I don't see what help hashing will provide, as
the result of any given query should be the same, hashing or not.

> i mean afterall: you still wnat the index to be useful for searching
> right? ... if you are really paranoid don't just strip the positions,
> strip all duplicate terms as well to prevent any attempt at statistical
> sampling ... but now all you relaly have is a lookup table of word to
> docid with no tf/idf or position info to improve scoring, so why bother
> with Lucene, jsut use a BerkleyDB file to do your lookups.

You could also do both.  Another thing that might help is relatively
aggressive stop word removal.  All these measures will raise the
"discouragement" bar slightly.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message