lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shyam Bhaskaran <Shyam.Bhaska...@synopsys.com>
Subject RE: highlighting performance poor with *.tar, *.gz files
Date Sat, 26 Nov 2011 03:54:58 GMT
Hi Eric,

Thanks for the response.

I am already using termVectors with offsets & positions enabled as shown below.


<field name="attachment_bodies"  type="text_rev"    indexed="true"  stored="true"  multiValued="true"
termVectors="true" termPositions="true" termOffsets="true" />


I am indexing FAQ content and some these FAQ has attachments linked to them and these attachments
have files like PDF, DOC *.TAR , *.GZIP files that contains additional information related
to the FAQ and all these contents are indexed. But while searching and highlighting it is
observed that for archived files like *.gz, *.tar, *.zip the search performance degrades and
using the debug flag I am finding that the time taken for highlighting these *.gz, *.tar,
*.zip archived files is taking more time.

What could be the reason behind it ? Is it because these files are unzipped and then highlighted
from the index during display time ?

Is the highlighting dependent on file size what I mean is if the file size is more, then does
the performance of the search degrades because of the highlighting ?

I have tried to reduce the maxAnalyzedChars value from 5MB to 1 MB bus still do not see any
significant improvement in the search and highlighting for these kind of files.

Let me know if you can suggest any workaround for improving the highlighting and search performance
for these kind of files or even files having large file size ?


Thanks
Shyam

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Saturday, November 26, 2011 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: highlighting performance poor with *.tar, *.gz files

Highlighting is dependent on the size of the
data being fed through the highlighter. Unless you have
termVectors & offsets & positions enabled, the text
must be re-analyzed, see:
http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=%28termvector%29%7C%28retrieve%29%7C%28contents%29

But highlighting compressed files seems like an odd
use-case, what is the business reason you need to do this?

Best
Erick

On Thu, Nov 24, 2011 at 10:28 AM, Shyam Bhaskaran
<Shyam.Bhaskaran@synopsys.com> wrote:
> Hi,
>
> It is observed that highlighting of search results is taking too much time especially
for highlighting terms for archived files like *.gz, *.tar, *.zip.
> What could be the reason behind it ? Is it because these files are unzipped and then
highlighted from the index during display time ?
> Or is it dependent on the size of the file ? Is there any way by which the search &
highlighter performance improves for these kind of archived files (*.tar, *.zip etc)
>
> Let me know if there is any workaround for improving the highlighting and search performance
for these kind of files?
>
> -Shyam
>

Mime
View raw message