lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmed Hammad" <ahm...@gmail.com>
Subject Re: Regex Transformer Error
Date Mon, 17 Nov 2008 21:19:47 GMT
Hi All,

Although the HTMLStripStandardTokenizerFactory will remove HTML tags, it
will be stored in the index and needed to be removed while searching. In my
case the HTML tags has no need at all. So I created HTMLStripTransformer for
the DIH to remove the HTML tags and save space on the index. I have used the
HTML parser included with Lucene ( org.apache.lucene.demo.html). It is well
performing and worked with me (while working with Lucene before moving to
Solr)

What do you think? Does it worth contribution?

My best wishes,

Regards,
Ahmed

On Thu, Nov 6, 2008 at 2:39 AM, Norskog, Lance <lance@divvio.com> wrote:

> There is a nice HTML stripper inside Solr.
> "solr.HTMLStripStandardTokenizerFactory"
>
> -----Original Message-----
> From: Ahmed Hammad [mailto:ahm507@gmail.com]
> Sent: Wednesday, November 05, 2008 10:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Regex Transformer Error
>
> Hi,
>
> It works with the attribute regex="&lt;(.|\n)*?&gt;"
>
> Sorry for the disturbance.
>
> Regards,
>
> ahmd
>
>
> On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <ahm507@gmail.com> wrote:
>
> > Hi,
> >
> > I am using Solr 1.3 data import handler. One of my table fields has
> > html tags, I want to strip it of the field text. So obviously I need
> > the Regex Transformer.
> >
> > I added transformer="RegexTransformer" attribute to my entity and a
> > new field with:
> >
> > <field sourceColName="content" column="content" regex="English"
> > replaceWith="XXXXX"/>
> >
> > Every thing works fine. The text is replace without any problem. The
> > provlem happend with my regular experession to strip html tags. So I
> > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not
> > allowed in XML. I tried the following regex="&lt;(.|\n)*?&gt;" and
> > regex="&#3C;(.|\n)*?&#3E;" but I get the following error:
> >
> > The value of attribute "regex" associated with an element type "field"
>
> > must not contain the '<' character. at
> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> > Source) ...
> >
> > The full stack trace is following:
> >
> > *FATAL: Could not create importer. DataImporter config invalid
> > org.apache.solr.common.SolrException: FATAL: Could not create
> importer.
> > DataImporter config invalid at
> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> > Handler.java:114)
> > at
> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> > (DataImportHandler.java:206)
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> > rBase.java:131) at
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> > java:303)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> > .java:232)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> > cationFilterChain.java:235)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> > lterChain.java:206)
> > at
> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> > lve.java:233)
> > at
> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> > lve.java:191)
> > at
> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> > va:128)
> > at
> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> > va:102)
> > at
> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
> > e.java:109)
> > at
> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> > :286)
> > at
> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
> > .java:857)
> > at
> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
> > cess(Http11AprProtocol.java:565) at
> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
> > org.apache.solr.handler.dataimport.DataImportHandlerException:
> > Exception occurred while initializing context Processing Document # at
> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> > orter.java:176)
> > at
> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja
> > va:93)
> > at
> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> > Handler.java:106) ... 17 more Caused by:
> > org.xml.sax.SAXParseException: The value of attribute "regex"
> > associated with an element type "field" must not contain the '<'
> > character. at
> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> > Source) at
> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
> > own
> > Source) at
> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> > orter.java:166)
> > ... 19 more *
> >
> > *description* *The server encountered an internal error (FATAL: Could
> > not create importer. DataImporter config invalid
> > org.apache.solr.common.SolrException: FATAL: Could not create
> importer.
> > DataImporter config invalid at
> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> > Handler.java:114)
> > at
> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> > (DataImportHandler.java:206)
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> > rBase.java:131) at
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> > java:303)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> > .java:232)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> > cationFilterChain.java:235)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> > lterChain.java:206)
> > at
> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> > lve.java:233)
> > at
> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> > lve.java:191)
> > at
> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> > va:128)
> > at
> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> > va:102)
> > at
> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
> > e.java:109)
> > at
> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> > :286)
> > at
> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
> > .java:857)
> > at
> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
> > cess(Http11AprProtocol.java:565) at
> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
> > org.apache.solr.handler.dataimport.DataImportHandlerException:
> > Exception occurred while initializing context Processing Document # at
> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> > orter.java:176)
> > at
> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja
> > va:93)
> > at
> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> > Handler.java:106) ... 17 more Caused by:
> > org.xml.sax.SAXParseException: The value of attribute "regex"
> > associated with an element type "field" must not contain the '<'
> > character. at
> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> > Source) at
> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
> > own
> > Source) at
> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> > orter.java:166) ... 19 more ) that prevented it from fulfilling this
> > request.*
> >
> > I appreciate your help.
> >
> > Regards,
> > ahmd
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message