lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Apache logs and data
Date Mon, 19 Nov 2007 20:15:02 GMT
I'm not sure where the personal info is leaked, we aren't proposing to  
make who made the query available, just what the query is and I  
suspect the IP info, etc. could be stripped fairly easily.   So, we  
wouldn't necessarily know who is searching for "Yonik Seeley" when we  
see that query term, just that it was searched for.  Maybe we can  
inquire to infrastructure what is even available first.  The other  
question is if ASF has a disclaimer about information being logged,  
etc.  For instance, all emails to public mailing lists are considered  

At any rate, I think the bigger issue is finding a good set of data  
and query logs that we can use.  An alternate way is to just start  
creating a query set based on the Wikipedia data, but that isn't as  
"real world" as query logs are.

Here's another possible thought:  What if we took our own java-user  
mailing list for a time period and we used the subject line or some  
other piece of info in the text (maybe we can automatically identify  
questions (not hard to do for simple cases (just identify sentences  
ending in ?), which would give us enough, methinks) and treat them as  
queries?  This may be a decent approximation of a user's information  
need and probably wouldn't be all that hard to crank out and it has  
the nice feature that the user has consented to make the information  

Of course, we could see if there is a way to purchase the TREC data  
(donations, anyone?) and make it available to committers on zones.   
This is about the only legal way to do this, but to me is less than  
satisfactory as it doesn't allow much innovation from other  
contributors.  See;#52022

  for that discussion.


On Nov 19, 2007, at 1:46 PM, Chris Hostetter wrote:

> : > report of (querystring,accesscount)->url mappings based on  
> requests that
> : > had a major search engine as the refer URL, that should be fine  
> right?
> :
> : Query strings can leak personal info too (think of someone googling
> : themselves or their SSN)
> right ... i'm not suggesting we do this in an automatic un-human- 
> involved
> way; i'm suggesting that a "trusted" person generate this report,
> ignore anything with a count less then some number (both to remove  
> noise,
> and eliminate most of the random "identifiable" queries), and then
> manually remove anything that looks "personal"
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message