lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Apache logs and data
Date Mon, 19 Nov 2007 20:54:59 GMT

On Nov 19, 2007, at 3:41 PM, Chris Hostetter wrote:

> : info, etc. could be stripped fairly easily.   So, we wouldn't  
> necessarily know
> : who is searching for "Yonik Seeley" when we see that query term,  
> just that it
> : was searched for.  Maybe we can inquire to infrastructure what is  
> even
> It's a largely theoretical arguement (particularly relating to a  
> subset of
> results on a specific domain as opposed to a subset from a specific  
> search
> engine) but the nutshell is: there may in fact be identifiable info in
> the query string itself, so it's good to have some sanity checking  
> before
> exposing the queries to the world.


> : At any rate, I think the bigger issue is finding a good set of  
> data and query
> : logs that we can use.  An alternate way is to just start creating  
> a query set
> : based on the Wikipedia data, but that isn't as "real world" as  
> query logs are.
> I think looking at refer URLs containing query strings grouped by  
> TLP site
> would give us lots of useful "small" collections of docs and query  
> strings
> that are considered "relevent" (albeit: not by a human judgement,  
> but by
> some other search engine -- it's a start)
> if you take something like the online HTTPD manual, each URL can be  
> easily
> mapped to a machine parsable XML version, and i'm sure we can find  
> plenty
> of good query strings in the refer logs for


> : Here's another possible thought:  What if we took our own java- 
> user mailing
> : list for a time period and we used the subject line or some other  
> piece of
> : info in the text (maybe we can automatically identify questions  
> (not hard to
> : do for simple cases (just identify sentences ending in ?), which  
> would give us
> : enough, methinks) and treat them as queries?  This may be a decent
> two concerns i would have:
>  1) the person asking the question doesn't always know what to ask  
> about
>     (the X/Y problem) which could lead to missleading query/result
>     matches.
>  2) people aren't always "on topic" ... discussions can branch/evolve
>     without subjects changing (formatl documentation doesn't really  
> have
>     this problem)

Both true, but as with the other scenarios (except TREC) there is a  
human in the loop and we don't have to take every question available,  
just 100 or so good ones.  Maybe we could even use the FAQs applied  
against the archive.

The other hard part about the mail archive is that you are likely to  
have matches against emails which are asking the question and not just  
those emails answering the question.  Not sure if those are relevant  
or not.  Sometimes, for me, just reading how someone else phrased the  
problem is enough to spur an answer.

> : Of course, we could see if there is a way to purchase the TREC data
> : (donations, anyone?) and make it available to committers on  
> zones.  This is
> if spending money is an option, but spending enough money for TREC  
> isn't
> an option, something i've been considering is using Amazon's  
> mechanical
> turk to generate judgements ... take some seed data (ie: refer log  
> query
> strings and the title/summary/url of the top 5 URLs for each) and give
> mturk users $0.05 to rank those 5 in order of how well they match.

I believe the TREC collection costs somewhere around $300, so it isn't  
going to break the bank.  Perhaps we could ask the board to pay it or  
maybe we could arrange for donations.  I'd be willing to kick in up to  
$50 to have it available, but I still don't like this route since only  
committers can have access b/c it is on zones and I don't know that  
this is that high of a priority for committers.  Instead, I want  
something researchers and upstart grad students can easily download  
and try out and that we can all then discuss b/c we all have the  
data.  Furthermore, by having multiple data sets, we can hopefully  
avoid the overtuning problem.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message