lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Spam <ps...@mac.com>
Subject Re: Solr searching performance issues, using large documents
Date Mon, 02 Aug 2010 17:01:46 GMT
What would happen if the search query phrase spanned separate document chunks?

Also, what would the optimal size of chunks be?

Thanks!


-Peter

On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:

> Not that I know of.
> 
> The DataImportHandler has the ability to create multiple documents
> from one input stream. It is possible to create a DIH file that reads
> large log files and splits each one into N documents, with the file
> name as a common field. The DIH wiki page tells you in general how to
> make a DIH file.
> 
> http://wiki.apache.org/solr/DataImportHandler
> 
> From this, you should be able to make a DIH file that puts log files
> in as separate documents. As to splitting files up into
> mini-documents, you might have to write a bit of Javascript to achieve
> this. There is no data structure or software that implements
> structured documents.
> 
> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <pspam@mac.com> wrote:
>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>> 
>> 
>> -Peter
>> 
>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>> 
>>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>> 
>>> Highlighting does not stream- it pulls the entire stored contents into
>>> one string and then pulls out the snippet.  If you want this to be
>>> fast, you have to split up the text into small pieces and only
>>> snippetize from the most relevant text. So, separate documents with a
>>> common group id for the document it came from. You might have to do 2
>>> queries to achieve what you want, but the second query for the same
>>> query will be blindingly fast. Often <1ms.
>>> 
>>> Good luck!
>>> 
>>> Lance
>>> 
>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <pspam@mac.com> wrote:
>>>> However, I do need to search the entire document, or else the highlighting
will sometimes be blank :-(
>>>> Thanks!
>>>> 
>>>> - Peter
>>>> 
>>>> ps. sorry for the many responses - I'm rushing around trying to get this
working.
>>>> 
>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>> 
>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the
hl.regex.maxAnalyzedChars the first time.
>>>>> Thanks!
>>>>> 
>>>>> -Peter
>>>>> 
>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>> 
>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>> 
>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>> 
>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an
impact (one search I just tried went from 17 seconds to 15.8 seconds, and this is an 8-core
Mac Pro with 6GB RAM - 4GB for java).
>>>>>> 
>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>> What does the 'fuzzy' variable mean? If you use this to query
via
>>>>>>> "~someTerm" instead "someTerm"
>>>>>>> then you should try the trunk of solr which is a lot faster for
fuzzy or
>>>>>>> other wildcard search.
>>>>>> 
>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>> 
>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>> 
>>>>>> 
>>>>>> - Peter
>>>>>> 
>>>>>>> Regards,
>>>>>>> Peter.
>>>>>>> 
>>>>>>>> Data set: About 4,000 log files (will eventually grow to
millions).  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>> 
>>>>>>>> Problem: When I search for common terms, the query time goes
from under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable
highlighting, performance improves a lot, but is still slow for some queries (7 seconds).
 Thanks in advance for any ideas!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Peter
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> 4GB RAM server
>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> schema.xml changes:
>>>>>>>> 
>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>    <analyzer>
>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>    </analyzer>
>>>>>>>>  </fieldType>
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
multiValued="false" termVectors="true" termPositions="true" termOffsets="true" />
>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true"
default="NOW" multiValued="false"/>
>>>>>>>> <field name="version" type="string" indexed="true" stored="true"
multiValued="false"/>
>>>>>>>> <field name="device" type="string" indexed="true" stored="true"
multiValued="false"/>
>>>>>>>> <field name="filename" type="string" indexed="true" stored="true"
multiValued="false"/>
>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true"
multiValued="false"/>
>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true"
multiValued="false"/>
>>>>>>>> <field name="first2md5" type="string" indexed="false"
stored="true" multiValued="false"/>
>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true"
multiValued="false"/>
>>>>>>>> 
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true"
/>
>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> solrconfig.xml changes:
>>>>>>>> 
>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>> 
>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> The query:
>>>>>>>> 
>>>>>>>> rowStr = "&rows=10"
>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv)
+ "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
'').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>> 
>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' +
(p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
+ hl + hl_regex
>>>>>>>> 
>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s)
+ '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> http://karussell.wordpress.com/
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com


Mime
View raw message