creadur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marija Šljivović <mak...@gmail.com>
Subject A way to implement Krugle and Koders code search parsers for apache-rat-pd
Date Sun, 09 Aug 2009 22:34:41 GMT
Hi

For some time I am working on a implementation of Koders [1] search parser.
Aim is to make working and maintenanceable version of koders engine parser.
It must have ability to load koders.com webpage, to enter query and
parse retrieved result page - and to extract useful information from
it.
Sometimes it can be more then one page with results to check. This
code must be easy to maintenance and change(if site is changed).

Same thing must be done with Krugle code search [2]
With Krugle there is another option called "advanced search". It can
be used for large code part search.

With Google Code search it is easy because google have library to
access that service.

After research I found a library which can provide us ability to
access this sites.
This tool is HTMLUnit. [3]
It is "GUI-Less browser for Java programs". It provide api to access
any interesting information on webpage even if it have a lot of
javascript.
With it I already can parse koders code search result page and read
code from GoogleCodesearch (GWT is supported )which can not be
regularly be retrieved by gdata-codesearch api. Gdata-codesearch api
does not have support to retrieve language of search result but using
HTMLUnit it is possible.
There is no other library which can parse GWT (GoogleCodeSearch) and
other javas cript pages with this amount of a success.
HTMLUnit is licensed by Apache2.0 license. It is already mavenised.
Only disadvantage of using this library in our code is a lot of
project dependencies and it's name, but even if it is mainly used for
testing, it can be used very well to retrieve information from web,
too.
So, I believe that using this library will help us to work with all
three parsers in common way.

What is your opinion about using HTMLUnit in apache-rat-pd project?

On apache-rat-pd project page is sample of using HTMLUnit to parse
Koders code search engine. [4]

Best regards,
Marija

[1] http://www.koders.com/

[2] http://www.krugle.com/

[3] http://htmlunit.sourceforge.net/

[4] http://code.google.com/p/apache-rat-pd/downloads/list

Mime
View raw message