lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Blandford <simon.blandf...@bkconnect.net>
Subject Metadata and HTML ending up in searchable text
Date Thu, 26 May 2016 13:48:44 GMT
Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting 
with...
body_txt_en: " stream_size 36499 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By...."

And then once it gets to the actual text I get CSS class names appearing 
that were in <p> or <div> tags etc.
e.g. "....the power of calibre3 silence calibre2 and....", where 
"calibre3" etc are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in 
searching for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command
curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=body_txt_en&commit=true"

-F 
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command
curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=body_txt_en&commit=true"

-F 
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL: 
http://localhost:8983/solr/mycore/select?q=especially&wt=json

Result:

For the txt file, I get the following JSON for the document...

{
     id: "doc1",
     attr_stream_size: [
         "8107"
     ],
     attr_x_parsed_by: [
         "org.apache.tika.parser.DefaultParser",
         "org.apache.tika.parser.txt.TXTParser"
     ],
     attr_stream_content_type: [
         "text/plain"
     ],
     attr_stream_name: [
         "UsingMailingLists.txt"
     ],
     attr_stream_source_info: [
         "content/UsingMailingLists.txt"
     ],
     attr_content_encoding: [
         "ISO-8859-1"
     ],
     attr_content_type: [
         "text/plain; charset=ISO-8859-1"
     ],
     body_txt_en: " stream_size 8107 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By 
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain 
stream_name UsingMailingLists.txt stream_source_info 
content/UsingMailingLists.txt Content-Encoding ISO-8859-1 Content-Type 
text/plain; charset=ISO-8859-1 Search: [value ] [Titles] [Text] 
Solr_Wiki Login ****** UsingMailingLists ****** * FrontPage * 
RecentChanges...etc",
_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
     id: "doc2",
         attr_stream_size: [
         "20440"
     ],
     attr_x_parsed_by: [
         "org.apache.tika.parser.DefaultParser",
         "org.apache.tika.parser.html.HtmlParser"
     ],
     attr_stream_content_type: [
         "text/html"
     ],
     attr_stream_name: [
         "UsingMailingLists.html"
     ],
     attr_stream_source_info: [
         "content/UsingMailingLists.html"
     ],
     attr_dc_title: [
         "UsingMailingLists - Solr Wiki"
     ],
     attr_content_encoding: [
         "UTF-8"
     ],
     attr_robots: [
         "index,nofollow"
     ],
     attr_title: [
         "UsingMailingLists - Solr Wiki"
     ],
     attr_content_type: [
         "text/html; charset=utf-8"
     ],
     body_txt_en: " stylesheet text/css utf-8 all 
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen 
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print 
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection 
/wiki/modernized/css/projection.css alternate Solr Wiki: 
UsingMailingLists 
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1

application/rss+xml Start /solr/FrontPage Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw Alternate print Print View 
/solr/UsingMailingLists?action=print Search /solr/FindPage Index 
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting 
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser 
X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type 
text/html stream_name UsingMailingLists.html stream_source_info...etc",
     _version_: 1535398408383103000
}




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message