lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fredrik Andersson" <fidde.anders...@gmail.com>
Subject Re: Get element Class DOM !!!!
Date Tue, 13 Jan 2009 17:42:16 GMT
This has nothing to do with Lucene, but as I have written something very
similar I'm taking the bait. You're best of using XPath or similar XML/HTML
query language to parse the product specs, prices or whatever you're after.
Each webshop you're indexing will have its own set of query expressions for
extracting the data you need. So, extract the data with a query language and
then write a clean Lucene index with the parsed data.

http://www.nabble.com/w3.org---www-xpath-comments-f11758.html

http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/package-summary.html

On Tue, Jan 13, 2009 at 6:29 PM, ppuyen <khongkhi02@mail.ru> wrote:

>
> hi everyone,
> I run example Indexing files HTML from "Lucene in Action " .
> there can getTitle and getBody of file HTML .
>
> protected String getTitle(Element rawDoc) {
>    if (rawDoc == null) {
>      return null;
>    }
>    //System.out.println("getTitle");
>    String title = "";
>    NodeList children = rawDoc.getElementsByTagName("title");
>    if (children.getLength() > 0) {
>      Element titleElement = ((Element) children.item(0));
>      Text text = (Text) titleElement.getFirstChild();
>      if (text != null) {
>        title = text.getData();
>      }
>    }
>        System.out.println("getTitle:"+ title);
>    return title;
>  }
>
>
> My project is commercial search engine. it's mean. when i find one product
> (example  Nokia N72 ) . after click button "Submit" , the result need show
> name of product and Price each shop.
>  I run file Indexing file HTML , there're can  getTitle and getBody.
> My problem now is get Class ( example :  < b class="Price"> $40 < /b> ) .
> but each web's Class name  is different .
> Help me how could i do ?
> thanks so much.
>
>
> --
> View this message in context:
> http://www.nabble.com/Get-element-Class-DOM-%21%21%21%21-tp21440434p21440434.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message