Mailing-List: contact rat-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: rat-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of egor.pasko@gmail.com
 designates 209.85.211.177 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=bnm4fvhxvQf8TAiK55laldAImNcWYR9/PR4eTyg2NmDl4MOnwskZLgUan/dzx0Rjf7
         whtMuoW7cHmTy1DpyWwbWJUiDt41j7JATlWC3TIGPwAs0WSR8oEjwra5jFrHH/e6QTfB
         2lbb/r3v2hg2mLQ/8+mnvGZ+Q1yknB7godqi0=
MIME-Version: 1.0
In-Reply-To: <5b553b550908111522l7b713725odd17503e6106b5e2@mail.gmail.com>
References: <5b553b550908091534o57bf9576td892a5f81043f581@mail.gmail.com>
	 <5b06ca0e0908100118n2af35c64n9ea52184969ef3d@mail.gmail.com>
	 <5b553b550908111522l7b713725odd17503e6106b5e2@mail.gmail.com>
Date: Thu, 13 Aug 2009 10:20:02 +0400
Message-ID: <5b06ca0e0908122320j40af6ad5l1269c8292b8140db@mail.gmail.com>
Subject: Re: A way to implement Krugle and Koders code search parsers for
	apache-rat-pd
From: Egor Pasko <egor.pasko@gmail.com>
To: rat-dev@incubator.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

2009/8/12 Marija =C5=A0ljivovi=C4=87 <maka82@gmail.com>:
> Hi to all,
>
> I just wanted to share with you some of my notice about HTMLUnit

thanks! ;)

>> I do not completely understand the problem being solved using this
>> library, and what kind of GWT support is discussed.
>
> Aim is to make apache-rat-pd work with Koders.com, GoogleCodeSearch
> and Krugle.com codesearch engines.
>
> Unfortunately, only GoogleCodeSearch provides library with API to use
> this engine programmatically. Koders and Krugle don't provide something
> like that. GoogleCode search engine is very powerful and has great
> regex support, but in gdata-codesearch API is missing 2 things: way to
> get source file and way to get language of source file.

I do not understand why we need to know the language in response. We
have a file, say Foo.java, we want to find, who has stolen parts of
it. Stolen parts are very unlikely, to be in python, C++, Malbolge or
any other language. So, we won't even try to search in that space.
Let's restrict search language to java for Foo.java. Google Code
Search allows that.

OK, full source file is a problem, I agree. However, this is a long
term problem that does not fit into SoC, IMHO.

> I found that HTMLUnit can easily do this missing things.
> GWT support matters only for GoogleCodeSearch, almost all google's
> sites are written using GWT.

To be honest, this is not true. Google products built on top of GWT
are a minority.

> GoogleCodeSearch too. HTMLUnit works
> great with =C2=A0GoogleCodeSearch. I was wondering if there are some othe=
r libraries
> which can do it so well.
>
>> Is this to later parse the entire file and analyse it using more
>> heavyweight heuristics than regexps? What kind of heuristics are they?
>
> If we have source file loaded in our application, we can do any
> heuristic check we can imagine.
> Now, we can only ask GoogleCodeSearch is something found using limited
> regex( and this is much more freedom then others search engines
> provide). Returned informations are list of matched code parts(singleline=
) and
> link to a site where we can see matched code. =C2=A0 If we have source fi=
le
> available for post-processing, we will be able to, at least, do
> Levenshtein distance analyze and check if there are only names of
> identificators changed. We can not do it now. We can then show
> matched code in our reporting tool without need to watch it in
> GoogleCodesearch site.

Thanks, now I understand your intentions better.

But you should understand that relying on a library that 'does
everything' is hard and time consuming. There will be various
incompatibilities between how HTMLUnit works and how browsers work.
There is no good agreement between browsers themselves about how they
should work, everyone treats pages in their own, unique way. For
example, what happens when there is an illegal array access? Browsers
try to prevent and silently ignore this. Each in their own way,
probably. Will HTMLUnit do the same or just crash? I have no idea.

So, generally speaking, although the approach might seem to be
universal, in fact, it is not.

>> My first concern is obviously the size of the dependency, the example
>> zip archive was about 7 MB, which is probably too much :) Another
>> concern is that waiting for javascript to be interpreted in java is a
>> very unreliable process.
>
> I totally agree. 7 MB is large dependency size. I hope that we can
> find better solution :(
>
> I think that using Krugle advanced search will be very
> difficult without using some library which supports JavaScript.

I did not touch krugle once before. Tried today. Well, my opinion
might seem irrelevant and rude, but ... krugle sucks! Really!


After entering a search query foo() I had to wait 10s of seconds while
the javascript-heavy page loads it's banner. Wow, how "impressive"!
Loading the final results took enormous amount of time .. like 40
seconds. All this time the javascript code was doing something,
sending some irrelevant data to my browser and failing with array
bounds checks. (I noticed all this with firebug). All this time
someone was scratching my disk intensively. Do they spy on me? Creepy.

Yes, I agree, it is nearly impossible to parse krugle results without
being able to execute javascript code. With firebug I found the final
AJAX request that fetches results (by the way, the request fetches
results less than in 10 seconds, this fact blew my mind). The request
contains magic keys in parameters that were obtained by some
nontrivial javascript activity.

Look, these guys are completely crazy: they cook a uselessly slow
interface, they fetch tons of irrelevant data with their buggy code
that fails in places. Do you believe they can actually search for
things? For code? I seriously doubt so. I would not even be surprised
to find out that a bunch of monkeys write code with your keywords each
time they get a request from you.

I suggest to Let them go!

> HTMLUnit can do it. Parsing JavaScript is a slow process, but again it
> is much faster
> then watching it in web browser :)

yeah, right :)

> So, advantages of HTMLUnit are:
>
> -It can provide to us all informations we are interested in.
> -It supports all three of code search engines

well, I'd say 'if we are lucky'

> -Code written for scraping data using HTMLUnit is less or more
> readable and maintenanceable.
> -HTMLUnit is stable project and it is very popular in past years [1].
> -It has Apache license
> -it is already mavenized
>
> Disadvantages are:
>
> -HTMLUnit is very large

I know an alternative approach: run OS in a virtual machine, with
internet, browsers, etc. Make screenshots in certain points in time,
OCR them, pick code from them. The only disadvantage is that the stack
is not completely Apache v.2.

Yeah, and do not forget about artificial intelligence that would
recognize the right suspected code with 100% precision. That's pretty
universal :)

> -It is used mainly to test web-pages, not to get informations from it.
> -probably there are other difficulties with using it which I don't
> notice so far...
>
> Because of that and because it is tool of choice when people choose
> web-site scraping API, I thought that we can use it in our tool. I
> thought about HTMLUnit size problem - we can make these parsers which
> are using HTMLUnit to be optional parts of apache-rat-pd(some
> pluginable architecture can be used)
>
> Of course, we can search more and eventually found an alternative to this
> library, maybe some of this:
> http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-=
java
> or this: http://sourceforge.net/projects/nekohtml/

Thanks for deep analysis, Marija, that is very cool that you actively
investigate possible uses of 3rd party libraries. You write code
samples to try them and argue about pros and cons. This is all very
impressive and you should definitely continue doing that in your
professional career.

In this specific case I am not against the tool, rather the approach.

To sum up: in the short term I do not see opportunities to improve
anything dramatically with this approach. In the long term it might
make sense to use super-duper javascript execution to scrape websites.
I am not an expert in this area, and I am afraid of many unexpected
difficulties :) You should consult with a real web hacker :)

--=20
Egor Pasko