incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From armon <zhime...@gmail.com>
Subject Re: about the supported input format of any23
Date Fri, 22 Jun 2012 09:26:36 GMT
Hi Lewis, 

I even as the xml data in a file, and then command: ./any23 rover @filepath ,but it still
can't work, finally,I create a simply xml data file to test, again nothing retrieved, so I
think maybe it is not the url issue, but related with parser engine. 

Is the any23 0.7 coming, will it meet my particular request? If so, then I just get the latest
0.7 and test it again.

thanks for your reply.

All the best!

armon.chen



On 2012年6月22日星期五 at 下午5:13, Lewis John Mcgibbney wrote:

> So I suppose there are a couple of options here.
> 
> On Fri, Jun 22, 2012 at 10:02 AM, armon <zhimeng9@gmail.com (mailto:zhimeng9@gmail.com)>
wrote:
> > 
> > but we know that there is some other data in the page that can't be retrieved, such
as the xml data (in the attachment of last email).
> 
> Yes there is a good bit more content but the parsing implementations
> within Any23 do not aim to extract content strings... instead the
> project (parsing anyway) gains its strength from extracting triples
> and such like.
> 
> You could quickly fire up a Nutch instance to gather content then use
> the basic-crawler from Any23 for triples... this is until we implement
> an Any23 parsing and indexing filter within Nutch which will provide a
> complete solution to your particular request.
> 
> You could easily implement the above programmatically which would
> enable you to fetch page content as well as extract the triples from
> it separately.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message