lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Apache logs and data
Date Mon, 19 Nov 2007 20:41:46 GMT

: info, etc. could be stripped fairly easily.   So, we wouldn't necessarily know
: who is searching for "Yonik Seeley" when we see that query term, just that it
: was searched for.  Maybe we can inquire to infrastructure what is even

It's a largely theoretical arguement (particularly relating to a subset of 
results on a specific domain as opposed to a subset from a specific search 
engine) but the nutshell is: there may in fact be identifiable info in 
the query string itself, so it's good to have some sanity checking before 
exposing the queries to the world.

: At any rate, I think the bigger issue is finding a good set of data and query
: logs that we can use.  An alternate way is to just start creating a query set
: based on the Wikipedia data, but that isn't as "real world" as query logs are.

I think looking at refer URLs containing query strings grouped by TLP site 
would give us lots of useful "small" collections of docs and query strings 
that are considered "relevent" (albeit: not by a human judgement, but by 
some other search engine -- it's a start)

if you take something like the online HTTPD manual, each URL can be easily 
mapped to a machine parsable XML version, and i'm sure we can find plenty 
of good query strings in the refer logs for

: Here's another possible thought:  What if we took our own java-user mailing
: list for a time period and we used the subject line or some other piece of
: info in the text (maybe we can automatically identify questions (not hard to
: do for simple cases (just identify sentences ending in ?), which would give us
: enough, methinks) and treat them as queries?  This may be a decent

two concerns i would have:
  1) the person asking the question doesn't always know what to ask about 
     (the X/Y problem) which could lead to missleading query/result 
  2) people aren't always "on topic" ... discussions can branch/evolve 
     without subjects changing (formatl documentation doesn't really have 
     this problem)

: Of course, we could see if there is a way to purchase the TREC data
: (donations, anyone?) and make it available to committers on zones.  This is

if spending money is an option, but spending enough money for TREC isn't 
an option, something i've been considering is using Amazon's mechanical 
turk to generate judgements ... take some seed data (ie: refer log query 
strings and the title/summary/url of the top 5 URLs for each) and give 
mturk users $0.05 to rank those 5 in order of how well they match.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message