lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Wed, 07 Oct 2009 04:13:58 GMT
Hello Mr.Hostetter,
Thank you for patiently reading through my post, I apologize for being
cryptic in my previous messages..

>>when you cut/pasted the facet output, you excluded the field names.  based
>>on the schema & solrconfig.xml snippets you posted later, i'm assuming
>>they are usstate, and keyword, but you have to be explicit so that people
can help correlate the
>>results you are getting with the schema you posted

I had to be brief as my facets are in the order of 100K over 800K documents
and also if I give the complete schema.xml I was afraid nobody would read my
long message :-) ..Hence I showed only relevant pieces of the result showing
different fields having same problem

>>i'm assuming they are usstate, and keyword, but you have to be explicit so
that people can help correlate the
>>results you are getting with the schema you posted -- for example, you
haven't posted anything that would verify that the usstate >>field actually
uses your keywordText field

Yes, you are right here is the compete relavant snippet regarding
keywordText and associated fields. keyword, keywordlower and
keywordformatted are all aggregations of all other fields like - person,
personformatted, organization, location. location itself is aggregation of
usstate, country. The aggregation is done seperately in custom code even
before indexing into solr

    <fieldType name="keywordText" class="solr.TextField"
sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <field name="person" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="organization" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="location" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="country" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="usstate" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="subject" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="keyword" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="keywordlower" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="personformatted" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="keywordformatted" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>

>>A huge gap is in what your synonym files contain ... something weird in
>>there could easily explain superfluous terms getting added to your data.
Here are my synonym entries
-------------------------------------------------------

#Persons
barack obama, barak obama, barack h. obama, barack hussein obama, barak
hussein obama
hillary clinton, hillary r. clinton, hillary rodham clinton
timothy geithner, tim geithner, timothy f. geithner, geithner, timothy franz
geithner
vladimir putin, putin

#Organizations
U.N, U.N., u.n, un, UN, United Nations => U.N
DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security => D.H.S
USCIS, United States Citizenship and Immigration Services, U.S.C.I.S. =>
United States Citizenship and Immigration Services, U.S.C.I.S
SEC, Securities and Exchange Commission, S.E.C, S.E.C, SEC. => Securities
and Exchange Commission, S.E.C
FCC, Federal Communications Commission, F.C.C, F.C.C. => Federal
Communications Commission, F.C.C
GSA, General Services Administration, G.S.A, G.S.A. => General Services
Administration, G.S.A
SBA, Small Business Administration, S.B.A, S.B.A. => Small Business
Administration, S.B.A.
FEMA, Federal Emergency Management Agency, FEMA. => FEMA
AT&T, ATT, ATT., AT&T., AT&T Wireless => AT&T
BBC, British Broadcasting Corporation, B.B.C, B.B.C. => B.B.C,BBC
Bank of America, BOA, B.O.A, Bank of America Corp, Bank of America Corp. =>
B.O.A
General Motors, G.M., G.M, GM, General Motors Corp., General Motors Corp =>
General Motors, G.M
NFL, National Football League, N.F.L, N.F.L. => N.F.L
Exxon Mobil, Exxon Mobil Corp => Exxon Mobil
Google, Google Inc, Google Inc. => Google
AIG, A.I.G, A.I.G., American International Group => American International
Group, A.I.G
Goldman Sachs, Goldman Sachs Inc., Goldman Sachs Group Inc, Goldman Sachs
Group Inc. => Goldman Sachs
GE, General Electric Co., General Electric Co, G.E, G.E., General Electric
=> G.E, General Electric
General Dynamics, General Dynamics Corp,General Dynamics Corp., General
Dynamics Information Technology, General Dynamics Advanced Information
Systems => General Dynamics
HP, Hewlett Packard Co,Hewlett Packard Co., Hewlett Packard,
Hewlett-Packard, Hewlett-Packard Corp,H.P, H.P. => Hewlett Packard, H.P
IBM, International Business Machines, I.B.M, International Business Machines
Corp => I.B.M
Johns Hopkins University, Johns Hopkins, JHU, J.H.U, J.H.U. => Johns Hopkins
University, JHU, J.H.U
J.C. Penney, J.C. Penney Co. => J.C. Penney
JPMorgan Chase, JPMorgan Chase & Co., JPMorgan Chase & Co, JPMorgan =>
JPMorgan Chase & Co.
Lockheed Martin, Lockheed Martin Corp, Lockheed Martin Corp., Lockheed,
Lockheed VH => Lockheed Martin
Merrill Lynch, Merrill Lynch & Co., Merrill, Merrill. => Merrill Lynch
Microsoft, Microsoft Corp., Microsoft Corp, Microsoft. => Microsoft
Northrop Grumman, Northrop Grumman Corp., Northrop Grumman Corp, Northrop,
Northrop Corp. => Northrop Grumman
Smyth Co., Smyth Co
Sony, Sony Corp., Sony Corp => Sony Corp.
TJX Companies, TJX, TJX Cos. => TJX Companies
Target Corp., Target Corp, Target Corp stores => Target Corp.
Walmart, WalMart Inc, WalMart Stores, WalMart Stores Inc, WalMart Stores
Inc. => WalMart Inc.
Yahoo, Yahoo Inc co, Yahoo Inc. => Yahoo Inc.
AP, AP., A.P, A.P., Associated Press => Associated Press

#Countries
USA,USA.,U.S.A.,u.s.a,u.s.a.,U.S,U.S.,US,US.,u.s, u.s.,United States,United
States of America,United States Of America,united states,united states of
america,united states of america => U.S.A
UAE,U.A.E.,United Arab Emirates,united arab emirates,uae,u.a.e, u.a.e. =>
United Arab Emirates,U.A.E
UK,U.K.,u.k,u.k.,United Kingdom,united kingdom => United Kingdom,U.K
USSR,U.S.S.R,U.S.S.R.,ussr,u.s.s.r,u.s.s.r.,Soviet Union,soviet
union,Russia,russia => U.S.S.R,Soviet Union,Russia

#usa states
Alabama, Ala., AL => Alabama
Alaska, AK => Alabama
Arizona, Ariz., Ariz, AZ => Arizona
Arkansas, Ark., AR => Arkansas
California, Calif., CA => California
Colorado, Colo., CO => Colorado
Connecticut, Conn., CT => Connecticut
Delaware, Del., DE => Delaware
Florida, Fla., FL => Florida
Georgia, Ga., GA => Georgia
Hawaii, Hawaii, HI => Hawaii
Idaho, Idaho, ID => Idaho
Illinois, Ill., IL => Illinois
Indiana, Ind., IN => Indiana
Iowa, IA => Iowa
Kansas, Kans., KS => Kansas
Kentucky, Ky., KY => Kentucky
Louisiana, La., LA => Louisiana
Maine, ME => Maine
Maryland, Md., MD => Maryland
Massachusetts, Mass., MA => Massachusetts
Michigan, Mich., MI => Michigan
Minnesota, Minn., MN => Minnesota
Mississippi, Miss., MS => Mississippi
Missouri, Mo., MO => Missouri
Montana, Mont., MT => Montana
Nebraska, Nebr., NE => Nebraska
Nevada, Nev., NV => Nevada
New Hampshire, N.H., NH => New Hampshire
New Jersey, N.J., NJ => New Jersey
New Mexico, N.M., NM => New Mexico
New York, N.Y., NY => New York
North Carolina, N.C., NC => North Carolina
North Dakota, N.D., ND => North Dakota
Ohio, OH => Ohio
Oklahoma, Okla., OK => Oklahoma
Oregon, Ore., OR => Oregon
Pennsylvania, Pa., PA => Pennsylvania
Rhode Island, R.I., RI => Rhode Island
South Carolina, S.C., SC => South Carolina
South Dakota, S.D., SD => South Dakota
Tennessee, Tenn., TN => Tennessee
Texas, Tex., TX => Texas
Utah, UT => Utah
Vermont, Vt., VT => Vermont
Virginia, Va., VA => Virginia
Washington, Wash., WA => Washington
West Virginia, W.Va., WV => West Virginia
Wisconsin, Wis., WI => Wisconsin
Wyoming, Wyo., WY => Wyoming

#US TERRITORIES
American Samoa, AS => American Samoa
DC,D.C.,D.C,dc,d.c,d.c.,District of Columbia,District Of Columbia,district
of columbia,Washington D.C.,Washington DC.,Washington DC,washington
d.c,washington dc,washington d.c.,washington,Washington => D.C,Washington
D.C
Federated States of Micronesia, FSM, FM => Micronesia
Guam, GU => Guam
Marshall Islands, MH => Marshall Islands
Northern Mariana Islands, MP => Northern Mariana Islands
Palau, PW => Palau
Puerto Rico, P.R., PR => Puerto Rico
Virgin Islands, V.I., VI => Virgin Islands, VI
-------------------------------------------------------

>>all that said: my best guess is that you have old data in your index from
>>an older version of your schema when you had differnet analyzers
>>configured.
I reindexed all 800K articles after wiping all of the index 2 times same
result

>>if a term is showing up in the facet counts, you can search on it -- find
>>the first doc that matches, verify that the term isn't actually in the
>>data, and then reindex that one doc -- if it stops matching your search
>>(and the facet count drops by one) then i'm right, just reindex
>>everything

That was the first thing I did...I ran analyzer on field types and fields-
No problem there.
Then I queried via solr admin console - keyword:"New" and it gave me docs
that had N.Y, New York, New Mexico etc (because of synonyms) but no docs
which had just "New"...but I could see it in the facets as I mentioned in
the previous posts...thats what was baffling me.


On Tue, Oct 6, 2009 at 7:58 PM, Chris Hostetter <hossman_lucene@fucit.org>wrote:

>
> A few comments about the info you've provided...
>
> when you cut/pasted the facet output, you excluded the field names.  based
> on the schema & solrconfig.xml snippets you posted later, i'm assuming
> they are usstate, and keyword, but you have to be explicit so that people
> can help correlate the
> results you are getting with the schema you posted -- for example, you
> haven't posted anything that would verify that the usstate field actually
> uses your keywordText field, for ll we know it has a different field type
> by mistake (which would explain your problem). ... you have to post
> everything that would let us connect the dots from input to output in
> order to see where things might be going wrong.
>
> A huge gap is in what your synonym files contain ... something weird in
> there could easily explain superfluous terms getting added to your data.
>
> all that said: my best guess is that you have old data in your index from
> an older version of your schema when you had differnet analyzers
> configured.
>
> if a term is showing up in the facet counts, you can search on it -- find
> the first doc that matches, verify that the term isn't actually in the
> data, and then reindex that one doc -- if it stops matching your search
> (and the facet count drops by one) then i'm right, just reindex
> everything.
>
> (this is where a timestamp field recording exactly when each doc was added
> to the index comes in handy, you can compare it with the file modification
> time on your schema.xml and be certain which docs where indexed prior to
> you changes)
>
>
>
> -Hoss
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message