lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Localize the largest fields (content) in index
Date Thu, 29 Mar 2012 19:54:30 GMT
I don't think there's really any reason SolrCloud won't work with
Tomcat, the setup is
probably just tricky. See:
http://lucene.472066.n3.nabble.com/SolrCloud-new-td1528872.html
It's about a year old, but might prove helpful.

Best
Erick

On Thu, Mar 29, 2012 at 3:41 PM, Vadim Kisselmann
<v.kisselmann@googlemail.com> wrote:
> Yes, i think so, too :)
> MLT doesn´t need termVectors really, but it´s faster with them. I
> found out, what
> MLT works better on the title field in my case, instead of big text fields.
>
> Sharding is in planning, but my setup with SolrCloud, ZK and Tomcat
> doesn´t work,
> see here: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201203.mbox/%3CCA+GXEZE3LCTtgXFzn9uEdRxMymGF=z0UJB9s8b0qkipAfn6fsA@mail.gmail.com%3E
> I split my huge index (150GB-index in this case is my test-index), and
> want use SolrCloud,
> but it´s not runnable with tomcat at this time.
>
> Best regards
> Vadim
>
>
> 2012/3/29 Erick Erickson <erickerickson@gmail.com>:
>> Yeah, it's worth a try. The term vectors aren't entirely necessary for
>> highlighting,
>> although they do make things more efficient.
>>
>> As far as MLT, does MLT really need such a big field?
>>
>> But you may be on your way to sharding your index if you remove this info
>> and testing shows problems....
>>
>> Best
>> Erick
>>
>> On Thu, Mar 29, 2012 at 9:32 AM, Vadim Kisselmann
>> <v.kisselmann@googlemail.com> wrote:
>>> Hi Erick,
>>> thanks:)
>>> The admin UI give me the counts, so i can identify fields with big
>>> bulks of unique terms.
>>> I known this wiki-page, but i read it one more time.
>>> List of my file extensions with size in GB(Index size ~150GB):
>>> tvf 90GB
>>> fdt 30GB
>>> tim 18GB
>>> prx 15GB
>>> frq 12GB
>>> tip 200MB
>>> tvx 150MB
>>>
>>> tvf is my biggest file extension.
>>> Wiki :This file contains, for each field that has a term vector
>>> stored, a list of the terms, their frequencies and, optionally,
>>> position and offest information.
>>>
>>> Hmm, i use termVectors on my biggest fields because of MLT and Highlighting.
>>> But i think i should test my performance without termVectors. Good Idea? :)
>>>
>>> What do you think about my file extension sizes?
>>>
>>> Best regards
>>> Vadim
>>>
>>>
>>>
>>>
>>> 2012/3/29 Erick Erickson <erickerickson@gmail.com>:
>>>> The admin UI (schema browser) will give you the counts of unique terms
>>>> in your fields, which is where I'd start.
>>>>
>>>> I suspect you've already seen this page, but if not:
>>>> http://lucene.apache.org/java/3_5_0/fileformats.html#file-names
>>>> the .fdt and .fdx file extensions are where data goes when
>>>> you set 'stored="true" '. These files don't affect search speed,
>>>> they just contain the verbatim copy of the data.
>>>>
>>>> The relative sizes of the various files above should give
>>>> you a hint as to what's using the most space, but it'll be a bit
>>>> of a hunt for you to pinpoint what's actually up. TermVectors
>>>> and norms are often sources of using up space.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Wed, Mar 28, 2012 at 10:55 AM, Vadim Kisselmann
>>>> <v.kisselmann@googlemail.com> wrote:
>>>>> Hello folks,
>>>>>
>>>>> i work with Solr 4.0 r1292064 from trunk.
>>>>> My index grows fast, with 10Mio. docs i get an index size of 150GB
>>>>> (25% stored, 75% indexed).
>>>>> I want to find out, which fields(content) are too large, to consider
measures.
>>>>>
>>>>> How can i localize/discover the largest fields in my index?
>>>>> Luke(latest from trunk) doesn't work
>>>>> with my Solr version. I build Lucene/Solr .jars and tried to feed Luke
>>>>> this these, but i get many errors
>>>>> and can't build it.
>>>>>
>>>>> What other options do i have?
>>>>>
>>>>> Thanks and best regards
>>>>> Vadim

Mime
View raw message