lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Della Bitta <michael.della.bi...@appinions.com>
Subject Re: What should focus be on hardware for solr servers?
Date Wed, 13 Feb 2013 17:25:24 GMT
Ooops: https://code.google.com/p/solrmeter/



Michael Della Bitta

------------------------------------------------
Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta
<michael.della.bitta@appinions.com> wrote:
> Matthew,
>
> With an index that small, you should be able to build a proof of
> concept on your own hardware and discover how it performs using
> something like SolrMeter:
>
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro <me@mshapiro.net> wrote:
>> Thanks for the reply.
>>
>> If the main amount of searches are the exact same (e.g. the empty search),
>>> the result will be cached. If 5,683 searches/month is the real count, this
>>> sounds like a very low amount of searches in a very limited corpus. Just
>>> about any machine should be fine. I guess I am missing something here.
>>> Could you elaborate a bit? How large is a document, how many do you expect
>>> to handle, what do you expect a query to look like, how should the result
>>> be presented?
>>
>>
>> Sorry, I should clarify our current statistics.  First of all I meant 183k
>> documents (not 183, woops).  Around 100k of those are full fledged html
>> articles (not web pages but articles in our CMS with html content inside of
>> them), the rest of the data are more like key/value data records with a lot
>> of attached meta data for searching.
>>
>> Also, what I meant by search without a search term is that probably 80%
>> (hard to confirm due to the lack of stats given by the GSA) of our searches
>> are done on pure metadata clauses without any searching through the content
>> itself, so for example "give me documents that have a content type of
>> video, that are marked for client X, have a category of Y or Z, and was
>> published to platform A, ordered by date published".  The searches that use
>> a search term are more like use the same query from the example as before,
>> but find me all the documents that have the string "My Video" in it's title
>> and description.  From the way that the GSA provides us statistics (which
>> are pretty bare), it appears like they do not count "no search term"
>> searches in part of those statistics (the GSA is not really built for not
>> using search terms either, and we've had various issues using it in this
>> way because of it).
>>
>> The reason we are using the GSA for this and not our MSSql database is
>> because some of this data requires multiple, and expensive, joins and we do
>> need full text search for when users want to use that option.  Also for
>> faceting.
>>
>>
>> On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen <te@statsbiblioteket.dk>wrote:
>>
>>> Matthew Shapiro [me@mshapiro.net] wrote:
>>>
>>> [Hardware for Solr]
>>>
>>> > What type of hardware (at a high level) should I be looking for.  Are the
>>> > main constraints disk I/O, memory size, processing power, etc...?
>>>
>>> That depends on what you are trying to achieve. Broadly speaking, "simple"
>>> search and retrieval is mainly I/O bound. The easy way to handle that is to
>>> use SSDs as storage. However, a lot of people like the old school solution
>>> and compensates for the slow seeks of spinning drives by adding  RAM and
>>> doing warmup of the searcher or index files. So either SSD or RAM on the
>>> I/O side. If the corpus is non-trivial is size that is, which brings us
>>> to...
>>>
>>> > Right now we have about 183 documents stored in the GSA (which will go
>>> up a
>>> > lot once we are on Solr since the GSA is limiting).  The search systems
>>> are
>>> > used to display core information on several of our homepages, so our
>>> search
>>> > traffic is pretty significant (the GSA reports 5,683 searches in the last
>>> > month, however I am 99% sure this is not correct and is not counting
>>> search
>>> > requests without any search terms, which consists of most of our search
>>> > traffic).
>>>
>>> If the main amount of searches are the exact same (e.g. the empty search),
>>> the result will be cached. If 5,683 searches/month is the real count, this
>>> sounds like a very low amount of searches in a very limited corpus. Just
>>> about any machine should be fine. I guess I am missing something here.
>>> Could you elaborate a bit? How large is a document, how many do you expect
>>> to handle, what do you expect a query to look like, how should the result
>>> be presented?
>>>
>>> Regards,
>>> Toke Eskildsen

Mime
View raw message