lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter W." <>
Subject Re: How to keep user search history and how to turn it into information?
Date Tue, 14 Aug 2007 20:28:47 GMT
Hey Lukas,

You can get a basic demo of this working in Lucene
first then make a more advanced and efficient version.

First, give each document in your index a score field
using NumberTools so it's sortable. When users perform
a search, log the unique document_id, IP address and
result position for the next step.

Use Hadoop to simplifiy your logs by mapping the id
and emitting IP's as intermediate values. Have
reduce collect unique document_id[IP addresses].

Read thru final output file, increment score value for
each IP who clicked on the document_id, re-index
Lucene and sort results (reverse order) by score field.

A more advanced version could store previous result
positions as Payloads but I don't understand this
new Lucene concept.


Peter W.

On Aug 10, 2007, at 5:56 AM, Lukas Vlcek wrote:

> Enis,
> Thanks for your time.
> I gave a quick glance at Pig and it seems good (seems it is  
> directly based
> on Hadoop which I am starting to play with :-). It obvious that a huge
> amount of data (like user queries or access logs) should be stored  
> in flat
> files which makes it convenient for further analysis by Pig (or  
> directly by
> Hadoop based tasks) or other tools. And I agree with you that size  
> of the
> index can be tracked in journal based style in separated log rather  
> then
> with every since user query. That is for the easier part of my  
> original
> question :-)
> The true art starts with the mining tasks itself. How to  
> efficiently use
> such data for bettering user experience with the search engine...
> On 8/10/07, Enis Soztutar <> wrote:
>>>> ...

>>>> Web server log analysis is a very popular topic nowadays, and  
>>>> you can
>>>> check for the literature, especially clickthrough data anaysis.  
>>>> All the
>>>> major search engines has to interpret the data to improve their
>>>> algorithms, and to learn from the latent "collective knowlege"  
>>>> hidden
>>>> in web server logs.
>>>> ...

>> ...
>> You do not have to implement this from scratch. You just have to  
>> specify
>> your data mining tasks, then write scripts(in pig latin) or write
>> map-reduce programs (in hadoop). Either of these are not that  
>> hard. I do
>> not think that there is any tool which may satisfy all you  
>> information
>> needs. So at the risk of repeating myself i suggest you to look at  
>> pig
>> at write some scripts to mine the data...

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message