lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: query parsing
Date Sun, 27 Sep 2015 03:05:20 GMT
No need to re-install Solr, just create a new core, this time it'd probably be
easiest to use the bin/solr create_core command. In the Solr
directory just type bin/solr create_core -help to see the options.

We're pretty much trying to migrate to using bin/solr for all the maintenance
we can, but as always the documentation lags the code.

Yeah, things are a bit ragged. The admin UI/core UI is really a legacy
bit of code that has _always_ been confusing, I'm hoping we can pretty
much remove it at some point since it's as trappy as it is.

Best,
Erick

On Sat, Sep 26, 2015 at 12:49 PM, Mark Fenbers <mark.fenbers@noaa.gov> wrote:
> OK, a lot of dialog while I was gone for two days!  I read the whole thread,
> but I'm a newbie to Solr, so some of the dialog was Greek to me.  I
> understand the words, of course, but applying it so I know exactly what to
> do without screwing something else up is the problem.  After all, that is
> how I got into the mess in the first place.  I'm glad I have good help to
> untangle the knots I've made!
>
> I'd like to start over (option 1 below), but does this mean delete all my
> config and reinstalling Solr??  Maybe that is not a bad idea, but I will at
> least save off my data-config.xml as that is clearly the one thing that is
> probably working right.  However, I did do quite a bit of editing that I
> would have to do again. Please advise...
>
> To be fair, I must answer Erick's question of how I created the data index
> in the first place, because this might be relevant...
>
> The bulk of the data is read from 9000+ text files, where each file was
> manually typed.  Before inserting into the database, I do a little bit of
> processing of the text using "sed" to delete the top few and bottom few
> lines, and to substitute each single-quote character with a pair of
> single-quotes (so PostgreSQL doesn't choke).  Line-feed characters are
> preserved as ASCII 10 (hex 0A), but there shouldn't be (and I am not aware
> of) any characters aside from what is on the keyboard.
>
> Next, I insert it with this command:
> psql -U awips -d OHRFC -c "INSERT INTO EventLogText VALUES('$postDate',
> '$user', '$postDate', '$entryText', '$postCatVal');"
>
> In case you are wondering about my table, it is defined in this way:
> CREATE TABLE eventlogtext (
>   posttime timestamp without time zone NOT NULL, -- Timestamp of this
> entry's original posting
>   username character varying(8), -- username (logname) of the original
> poster
>   lastmodtime timestamp without time zone, -- Last time record was altered
>   logtext text, -- text of the log entry
>   category integer, -- bit-wise category value
>   CONSTRAINT eventlogtext_pkey PRIMARY KEY (posttime)
> )
>
> To do the indexing, I merely use /dataimport?full-import, but it knows what
> to do from my data-config.xml; which is here:
>
> <dataConfig>
>     <dataSource driver="org.postgresql.Driver"
> url="jdbc:postgresql://dx1f/OHRFC" user="awips" />
>     <document>
>         <entity name="eventlogtext" query="SELECT posttime AS id, username,
> logtext, category FROM eventlogtext;"
>                 deltaQuery="SELECT posttime AS id FROM eventlogtext WHERE
> lastmodtime > '${dataimporter.last_index_time}';">
>             <entity name="categorytypes" query="SELECT catname FROM
> categorytypes WHERE catid='${eventlogtext.category}';">
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
>
> Hope this helps!
>
> Thanks,
> Mark
>
> On 9/24/2015 10:57 AM, Erick Erickson wrote:
>>
>> Geraint:
>>
>> Good Catch! I totally missed that. So all of our focus on schema.xml has
>> been... totally irrelevant. Now that you pointed that out, there's also
>> the
>> addition: add-unknown-fields-to-the-schema, which indicates you started
>> this up in "schemaless" mode.
>>
>> In short, solr is trying to guess what your field types should be and
>> guessing wrong (again and again and again). This is the classic weakness
>> of
>> schemaless. It's great for indexing stuff fast, but if it guesses wrong
>> you're stuck.
>>
>>
>> So to the original problem: I'd start over and either
>> 1> use the regular setup, not schemaless
>> or
>> 2> use the _managed_ schema API to explicitly add fields and fieldTypes to
>> the managed schema
>>
>> Best,
>> Erick
>>
>> On Thu, Sep 24, 2015 at 2:02 AM, Duck Geraint (ext) GBJH <
>> Geraint.Duck@syngenta.com> wrote:
>>
>>> Okay, so maybe I'm missing something here (I'm still relatively new to
>>> Solr myself), but am I right in thinking the following is still in your
>>> solrconfig.xml file:
>>>
>>>    <schemaFactory class="ManagedIndexSchemaFactory">
>>>      <bool name="mutable">true</bool>
>>>      <str name="managedSchemaResourceName">managed-schema</str>
>>>    </schemaFactory>
>>>
>>> If so, wouldn't using a managed schema make several of your field
>>> definitions inside the schema.xml file semi-redundant?
>>>
>>> Regards,
>>> Geraint
>>>
>>>
>>> Geraint Duck
>>> Data Scientist
>>> Toxicology and Health Sciences
>>> Syngenta UK
>>> Email: geraint.duck@syngenta.com
>>>
>>>
>>> -----Original Message-----
>>> From: Alessandro Benedetti [mailto:benedetti.alex85@gmail.com]
>>> Sent: 24 September 2015 09:23
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: query parsing
>>>
>>> I would focus on this :
>>>
>>> "
>>>
>>>> 5> now kick off the DIH job and look again.
>>>>
>>> Now it shows a histogram, but most of the "terms" are long -- the full
>>> texts of (the table.column) eventlogtext.logtext, including the
>>> whitespace
>>> (with %0A used for newline characters)...  So, it appears it is not being
>>> tokenized properly, correct?"
>>> Can you open from your Solr ui , the schema xml and show us the snippets
>>> for that field that seems to not tokenise ?
>>> Can you show us ( even a screenshot is fine) the schema browser page
>>> related ?
>>> Could be a problem of encoding ?
>>> Following Erick details about the analysis, what are your results ?
>>>
>>> Cheers
>>>
>>> 2015-09-24 8:04 GMT+01:00 Upayavira <uv@odoko.co.uk>:
>>>
>>>> typically, the index dir is inside the data dir. Delete the index dir
>>>> and you should be good. If there is a tlog next to it, you might want
>>>> to delete that also.
>>>>
>>>> If you dont have a data dir, i wonder whether you set the data dir
>>>> when creating your core or collection. Typically the instance dir and
>>>> data dir aren't needed.
>>>>
>>>> Upayavira
>>>>
>>>> On Wed, Sep 23, 2015, at 10:46 PM, Erick Erickson wrote:
>>>>>
>>>>> OK, this is bizarre. You'd have had to set up SolrCloud by
>>>>> specifying the -zkRun command when you start Solr or the -zkHost;
>>>>> highly unlikely. On the admin page there would be a "cloud" link on
>>>>> the left side, I really doubt one's there.
>>>>>
>>>>> You should have a data directory, it should be the parent of the
>>>>> index and tlog directories. As of sanity check try looking at the
>>>>> analysis page.
>>>>> Type
>>>>> a bunch of words in the left hand side indexing box and uncheck the
>>>>> verbose box. As you can tell I'm grasping at straws. I'm still
>>>>> puzzled why you don't have a "data" directory here, but that
>>>>> shouldn't really matter. How did you create this index? I don't mean
>>>>> data import handler more how did you create the core that you're
>>>>> indexing to?
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, Sep 23, 2015 at 10:16 AM, Mark Fenbers
>>>>> <mark.fenbers@noaa.gov>
>>>>> wrote:
>>>>>
>>>>>> On 9/23/2015 12:30 PM, Erick Erickson wrote:
>>>>>>
>>>>>>> Then my next guess is you're not pointing at the index you think
>>>>>>> you
>>>>
>>>> are
>>>>>>>
>>>>>>> when you 'rm -rf data'
>>>>>>>
>>>>>>> Just ignore the Elall field for now I should think, although
get
>>>>>>> rid
>>>>
>>>> of it
>>>>>>>
>>>>>>> if you don't think you need it.
>>>>>>>
>>>>>>> DIH should be irrelevant here.
>>>>>>>
>>>>>>> So let's back up.
>>>>>>> 1> go ahead and "rm -fr data" (with Solr stopped).
>>>>>>>
>>>>>> I have no "data" dir.  Did you mean "index" dir?  I removed 3
>>>>>> index directories (2 for spelling):
>>>>>> cd /localapps/dev/eventLog; rm -rfv index solr/spFile solr/spIndex
>>>>>>
>>>>>>> 2> start Solr
>>>>>>> 3> do NOT re-index.
>>>>>>> 4> look at your index via the schema-browser. Of course there
>>>>>>> 4> should
>>>>
>>>> be
>>>>>>>
>>>>>>> nothing there!
>>>>>>>
>>>>>> Correct!  It said "there is no term info :("
>>>>>>
>>>>>>> 5> now kick off the DIH job and look again.
>>>>>>>
>>>>>> Now it shows a histogram, but most of the "terms" are long -- the
>>>>>> full texts of (the table.column) eventlogtext.logtext, including
>>>>>> the
>>>>
>>>> whitespace
>>>>>>
>>>>>> (with %0A used for newline characters)...  So, it appears it is
>>>>>> not
>>>>
>>>> being
>>>>>>
>>>>>> tokenized properly, correct?
>>>>>>
>>>>>>> Your logtext field should have only single tokens. The fact that
>>>>>>> you
>>>>
>>>> have
>>>>>>>
>>>>>>> some very
>>>>>>> long tokens presumably with whitespace) indicates that you aren't
>>>>
>>>> really
>>>>>>>
>>>>>>> blowing
>>>>>>> the index away between indexing.
>>>>>>>
>>>>>> Well, I did this time for sure.  I verified that initially,
>>>>>> because it showed there was no term info until I DIH'd again.
>>>>>>
>>>>>>> Are you perhaps in Solr Cloud with more than one replica?
>>>>>>>
>>>>>> Not that I know of, but being new to Solr, there could be things
>>>>>> going
>>>>
>>>> on
>>>>>>
>>>>>> that I'm not aware of.  How can I tell?  I certainly didn't set
>>>>
>>>> anything up
>>>>>>
>>>>>> for solrCloud deliberately.
>>>>>>
>>>>>>> In that case you
>>>>>>> might be getting the index replicated on startup assuming you
>>>>>>> didn't blow away all replicas. If you are in SolrCloud, I'd just
>>>>>>> delete the collection and start over, after insuring that you'd
>>>>>>> pushed the configset up to Zookeeper.
>>>>>>>
>>>>>>> BTW, I always look at the schema.xml file from the Solr admin
>>>>>>> window
>>>>
>>>> just
>>>>>>>
>>>>>>> as
>>>>>>> a sanity check in these situations.
>>>>>>>
>>>>>> Good idea!  But the one shown in the browser is identical to the
>>>>>> one
>>>>
>>>> I've
>>>>>>
>>>>>> been editing!  So that's not an issue.
>>>>>>
>>>>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card - http://about.me/alessandro_benedetti
>>> Blog - http://alexbenedetti.blogspot.co.uk
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>> ________________________________
>>>
>>>
>>> Syngenta Limited, Registered in England No 2710846;Registered Office :
>>> Syngenta Limited, European Regional Centre, Priestley Road, Surrey
>>> Research
>>> Park, Guildford, Surrey, GU2 7YH, United Kingdom
>>> ________________________________
>>>   This message may contain confidential information. If you are not the
>>> designated recipient, please notify the sender immediately, and delete
>>> the
>>> original and any copies. Any use of the message by you is prohibited.
>>>
>

Mime
View raw message