any23-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Apap <fsa...@gmail.com>
Subject Re: Extracting Meta Tags
Date Tue, 08 Dec 2015 17:48:34 GMT
Here is one such url:
http://www.yummly.com/recipe/Gabis-Low-Carb-Yeast-Bread-1073667?columns=4&position=1%2F74

I'm currently able to get meta tags using the following code:

HTTPDocumentSource doc = new
HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(),
info.getId());

InputStream documentInputInputStream = doc.openInputStream();

TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream,
doc.getDocumentURI());

Document document = tagSoupParser.getDOM();

NodeList nl = document.getElementsByTagName("meta");

for (int i = 0; i < nl.getLength(); i++) {

//System.out.println(nl.item(i).getNodeType());

//System.out.println(nl.item(i).getNodeName());

Element e = (Element)(nl.item(i));

String name = e.getAttribute("property");

if (name == null || name.trim().length()==0){

name = e.getAttribute("name");

}

if (name==null || name.trim().length()==0){

name =  e.getAttribute("itemprop");

}

if (name!=null && name.trim().length()>0){

String value = e.getAttribute("content");

logger.info(name+" "+value);

info.addInfo("meta_"+name, value);

}

}

On Mon, Dec 7, 2015 at 10:59 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Frank,
>
> On Mon, Dec 7, 2015 at 3:50 PM, <user-digest-help@any23.apache.org> wrote:
>
>>
>> I'm trying to extract meta tags from webpages.  I'm using the code below
>> but am finding that only a small subset of meta tags are being returned.
>> There are meta tags like those for facebook open graph that i am interested
>> in that are not being returned?
>>
>
> By default Any23 Configuration [0] defines that HTML head meta tags should
> be extracted by default. There is therefore no need to change this
> behaviour as extraction of HTML meta tags 'should' be happening by default.
> You are also correctly defining this within your code as below!
> Can you please post an example of a URL we can test against?
> Thanks
> Lewis
>
> [0]
> https://github.com/apache/any23/blob/master/api/src/main/resources/default-configuration.properties#L70
>
>
>

Mime
View raw message