lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Obernberger <joseph.obernber...@gmail.com>
Subject Re: model building
Date Wed, 22 Mar 2017 13:58:43 GMT
Thank you Tim.  I appreciated the tips.  At this point, I'm just trying 
to understand how to use it.  The 30 tweets that I've selected so far, 
are, in fact threatening.  The things people say!  My favorite so far is 
'disingenuous twat waffle'.  No kidding.

The issue that I'm having is not with the model, it's with creating the 
model from a query other than *:*.

Example:

update(models2, batchSize="50",
              train(TRAINING,
                       features(TRAINING,
                                      q="*:*",
                                      featureSet="threat1",
                                      field="ClusterText",
                                      outcome="out_i",
                                      positiveLabel=1,
                                      numTerms=100),
                       q="*:*",
                       name="threat1",
                       field="ClusterText",
                       outcome="out_i",
                       maxIterations="100"))

Works great.  Makes a model - model works - can see reasonable results.  
However, say I've tagged a training set inside a larger collection 
called COL1 with a field called JoeID - like this:

update(models2, batchSize="50",
              train(COL1,
                       features(COL1,
                                      q="JoeID:Training",
                                      featureSet="threat2",
                                      field="ClusterText",
                                      outcome="out_i",
                                      positiveLabel=1,
                                      numTerms=1000),
                       q="JoeID:Training",
                       name="threat2",
                       field="ClusterText",
                       outcome="out_i",
                       maxIterations="100"))

This does not work as expected.  I can query the COL1 collection for 
JoeID:Training, and get a result set that I want to train on, but the 
model creation seems to not work.  At this point, if I want to make a 
model, I need to create a collection, put the training set into it, and 
then train on *:*.  This is fine, but I'm not sure if it's how it is 
supposed to work.

-Joe


On 3/21/2017 10:17 PM, Tim Casey wrote:
> Joe,
>
> To do this correctly, soundly, you will need to sample the data and mark
> them as threatening or neutral.  You can probably expand on this quite a
> bit, but that would be a good start.  You can then draw another set of
> samples and see how you did.  You use one to train and one to validate.
>
> What you are doing is probably just noise, from a model point of view, and
> it will probably not make too much difference how you index/query/model
> through the noise.
>
> I don't mean this critically, just plainly.  Effectively the less
> mathematically correctly you do this process, the more anecdotal the result.
>
> tim
>
>
> On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <joelsolr@gmail.com> wrote:
>
>> I've only tested with the training data in it's own collection, but it was
>> designed for multiple training sets in the same collection.
>>
>> I suspect you're training set is too small to get a reliable model from.
>> The training sets we tested with were considerably larger.
>>
>> All the idfs_ds values being the same seems odd though. The idfs_ds in
>> particular were designed to be accurate when there are multiple training
>> sets in the same collection.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
>> joseph.obernberger@gmail.com> wrote:
>>
>>> If I put the training data into its own collection and use q="*:*", then
>>> it works correctly.  Is that a requirement?
>>> Thank you.
>>>
>>> -Joe
>>>
>>>
>>>
>>> On 3/20/2017 3:47 PM, Joe Obernberger wrote:
>>>
>>>> I'm trying to build a model using tweets.  I've manually tagged 30
>> tweets
>>>> as threatening, and 50 random tweets as non-threatening.  When I build
>> the
>>>> mode with:
>>>>
>>>> update(models2, batchSize="50",
>>>>               train(UNCLASS,
>>>>                        features(UNCLASS,
>>>>                                       q="ProfileID:PROFCLUST1",
>>>>                                       featureSet="threatFeatures3",
>>>>                                       field="ClusterText",
>>>>                                       outcome="out_i",
>>>>                                       positiveLabel=1,
>>>>                                       numTerms=250),
>>>>                        q="ProfileID:PROFCLUST1",
>>>>                        name="threatModel3",
>>>>                        field="ClusterText",
>>>>                        outcome="out_i",
>>>>                        maxIterations="100"))
>>>>
>>>> It appears to work, but all the idfs_ds values are identical. The
>>>> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
>>>> For out_i it is either -1 for non-threatening tweets, and +1 for
>>>> threatening tweets.  I'm trying to follow along with Joel Bernstein's
>>>> excellent post here:
>>>> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
>>>> ystem-with-solrs.html
>>>>
>>>> Tips?
>>>>
>>>> Thank you!
>>>>
>>>> -Joe
>>>>
>>>>


Mime
View raw message