lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Bensley" <jbensley...@gmail.com>
Subject Re: Index Rows as Documents? Help me design a solution
Date Wed, 26 Jul 2006 16:23:44 GMT
Just my 0.02, but I think you are correct in creating one document per line
in your database in order to achieve your desired result.

In my experience, there are a few things that you might do differently :

The MaxBufferedDocs parameter has a huge impact on indexing speed. The
default of 10 is very 'safe', in that it uses very little memory but
severely restricts the indexer. Depending on the length of the lines in your
log and the RAM on the indexing machine, it seems like you could set this to
1000 or more; if you start running out of memory you could scale it back.

Also, it is probably not necessary to STORE the first two fields, since
those are your searching fields and you can retrieve that information from
the full line that does get stored. It might very slightly improve things if
your first two fields could be indexed UNTOKENIZED and avoid running the
analyzer.

If you are certain that your lines will not exceed a maximum length, you can
also set IndexWriter.setMaxFieldLength(max), which may also save you some
resources (memory, disk space). I think the default is 10000 characters,
which seems a bit large for your application.

I'm not an expert on Lucene, but I do know that changing these parameters in
my application makes a huge impact in performance.

Jeremy

On 7/26/06, Namit Yadav <namityadav@gmail.com> wrote:
>
> Thanks all for the responses. I am very pleasently surprised at the
> helpful
> responses that I am getting.
>
> Okay, I think I still haven't understood Lucene well. I am sure that I am
> not solving the problem the right way. So I am explaining the problem at a
> very high level here .. please tell me what my design should be:
>
> I have GBs of logs where each row is of the type
> "Col1#Col2#Col3#Col4#Col5...". Now I want to be able to search the logs
> for
> Col1 or Col2 and get all the rows containing these two columns.
>
> Now what I do is, I run a shell script to split the logs into smaller
> files
> of 1MB each, then index all the files just as the lucene example works.
> Then
> when I search for a term, I get the log-file names that contain the data.
> Then I buffer-read those files and find out rows containing the data.
>
> I am very sure this is a very bad way of solving the problem. There should
> be some way of me telling Lucene that it just needs to make sure that the
> two columns Col1 and Col2 can be searched, and skip the rest. Then there
> should be some way of telling Lucene to store the indexes in a way that a
> search query can result the complete row when searched for Col1 or Col2,
> instead of file-names containing the data.
>
> I tried to have each row as a document, but as my first mail says, I
> didn't
> get the kind of performance I wanted. I am going to run some checks (As
> Erick suggested). But Doron's email has made me wonder if I am doing it
> right at all.
>
> Can you guys please help me understand how this problem can be best
> solved?
>
> Thanks a lot for the help so far
>
> On 7/26/06, Mike Streeton <mike.streeton@ardentia.co.uk> wrote:
> >
> > The only way you might get the performance you want is to have multiple
> > IndexWriters writing to different indexes and then addAll are the end.
> > You would obviously have to handle the multi threading and distribution
> > of the parts of the log to each writer.
> >
> > Mike
> >
> > www.ardentia.com the home of NetSearch
> >
> > -----Original Message-----
> > From: Doron Cohen [mailto:DORONC@il.ibm.com]
> > Sent: 25 July 2006 22:23
> > To: java-user@lucene.apache.org
> > Subject: Re: Index Rows as Documents? Help me design a solution
> >
> > Few comments -
> >
> > > (from first posting in this thread)
> > > The indexing was taking much more than minutes for a 1 MB log file.
> > ...
> > > I would expect to be able to index at least a of GB of logs within 1
> > or 2
> > minutes.
> >
> > 1-2 minutes per GB would be 30-60 GB/Hour, which for a single
> > machine/jvm
> > is a lot - well at least I did not see Lucene index this fast.
> >
> > > doc.add(new Field("msisdn", columns[0], Field.Store.YES,
> > Field.Index.TOKENIZED));
> > > doc.add(new Field("messageid", columns[2], Field.Store.YES,
> > Field.Index.TOKENIZED));
> >
> > Is it really required to analyze the text for these fields - "msisdn" ,
> > "
> > messageid"?
> >
> > > doc.add(new Field("line", line, Field.Store.YES, Field.Index.NO));
> >
> > This is storing the original text of all input lines that are indexed -
> > quite an overhead.
> >
> > - Doron
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message