lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Yadav" <namitya...@gmail.com>
Subject Re: Index Rows as Documents? Help me design a solution
Date Wed, 26 Jul 2006 15:57:54 GMT
Thanks all for the responses. I am very pleasently surprised at the helpful
responses that I am getting.

Okay, I think I still haven't understood Lucene well. I am sure that I am
not solving the problem the right way. So I am explaining the problem at a
very high level here .. please tell me what my design should be:

I have GBs of logs where each row is of the type
"Col1#Col2#Col3#Col4#Col5...". Now I want to be able to search the logs for
Col1 or Col2 and get all the rows containing these two columns.

Now what I do is, I run a shell script to split the logs into smaller files
of 1MB each, then index all the files just as the lucene example works. Then
when I search for a term, I get the log-file names that contain the data.
Then I buffer-read those files and find out rows containing the data.

I am very sure this is a very bad way of solving the problem. There should
be some way of me telling Lucene that it just needs to make sure that the
two columns Col1 and Col2 can be searched, and skip the rest. Then there
should be some way of telling Lucene to store the indexes in a way that a
search query can result the complete row when searched for Col1 or Col2,
instead of file-names containing the data.

I tried to have each row as a document, but as my first mail says, I didn't
get the kind of performance I wanted. I am going to run some checks (As
Erick suggested). But Doron's email has made me wonder if I am doing it
right at all.

Can you guys please help me understand how this problem can be best solved?

Thanks a lot for the help so far

On 7/26/06, Mike Streeton <mike.streeton@ardentia.co.uk> wrote:
>
> The only way you might get the performance you want is to have multiple
> IndexWriters writing to different indexes and then addAll are the end.
> You would obviously have to handle the multi threading and distribution
> of the parts of the log to each writer.
>
> Mike
>
> www.ardentia.com the home of NetSearch
>
> -----Original Message-----
> From: Doron Cohen [mailto:DORONC@il.ibm.com]
> Sent: 25 July 2006 22:23
> To: java-user@lucene.apache.org
> Subject: Re: Index Rows as Documents? Help me design a solution
>
> Few comments -
>
> > (from first posting in this thread)
> > The indexing was taking much more than minutes for a 1 MB log file.
> ...
> > I would expect to be able to index at least a of GB of logs within 1
> or 2
> minutes.
>
> 1-2 minutes per GB would be 30-60 GB/Hour, which for a single
> machine/jvm
> is a lot - well at least I did not see Lucene index this fast.
>
> > doc.add(new Field("msisdn", columns[0], Field.Store.YES,
> Field.Index.TOKENIZED));
> > doc.add(new Field("messageid", columns[2], Field.Store.YES,
> Field.Index.TOKENIZED));
>
> Is it really required to analyze the text for these fields - "msisdn" ,
> "
> messageid"?
>
> > doc.add(new Field("line", line, Field.Store.YES, Field.Index.NO));
>
> This is storing the original text of all input lines that are indexed -
> quite an overhead.
>
> - Doron
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message