lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Unsuccessful queries for terms next to tabs and newlines in uploaded Word documents
Date Mon, 31 Mar 2014 09:29:57 GMT
What field type and analyzer are you using? Normally, both the standard ad 
whitespace tokenizers will break tokens at all white space, which includes 
tabs.

Check your df and qf parameters to see that they are querying the 
attr_content field. Query the attr_content field directly, as a test.

-- Jack Krupansky

-----Original Message----- 
From: chtjfi
Sent: Monday, March 31, 2014 3:23 AM
To: solr-user@lucene.apache.org
Subject: Unsuccessful queries for terms next to tabs and newlines in 
uploaded Word documents

Short Version: What do I need to do to successfully query for terms that are
adjacent to tabs and newlines (i.e. \t, \n) in an uploaded Word document?

Long Version:

I am using Solr 4.6.1. I am running an unmodified version of the example
core that is started by running java -jar start.jar in the example
directory. The schema.xml in use is example/solr/collection1/conf/schema.xml
and is unmodified (it is the one downloaded with the distribution), so I
won't post it unless someone says it is helpful.

After uploading a Word document to Solr with the command
http://localhost:8983/solr/update/extract?literal.id=yabba&uprefix=attr_&fmap.content=attr_content&commit=true
there are hundreds of tab and newline characters (i.e. \n and \t) in the
attr_content field. When a string occurs only once in the document, and is
adjacent to one of these characters, queries for that term are not
successful.

A specific example is an uploaded Word document that after upload contains
"Vorname:\t\t\tYasmin" in the attr_content field. The original document
contained "Vorname:", then two tab characters, then "Yasmin" (the string
"\t" does not appear in the document). The string "Yasmin" appears only in
that location in the document.

When I query for "Yasmin" with the query
http://127.0.0.1:8983/solr/collection1/select?q=Yasmin&wt=json&indent=true I
get no results. Queries for terms that are not next to a \t or a \n are
successful.

What can I do so that a query for a term next to a tab or newline will be
successful? Must I change the way the document is uploaded? Or change the
way the search is performed?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unsuccessful-queries-for-terms-next-to-tabs-and-newlines-in-uploaded-Word-documents-tp4128090.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Mime
View raw message