Return-Path: Delivered-To: apmail-incubator-rat-dev-archive@minotaur.apache.org Received: (qmail 73109 invoked from network); 13 Aug 2009 06:20:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Aug 2009 06:20:28 -0000 Received: (qmail 70605 invoked by uid 500); 13 Aug 2009 06:20:35 -0000 Delivered-To: apmail-incubator-rat-dev-archive@incubator.apache.org Received: (qmail 70559 invoked by uid 500); 13 Aug 2009 06:20:35 -0000 Mailing-List: contact rat-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: rat-dev@incubator.apache.org Delivered-To: mailing list rat-dev@incubator.apache.org Received: (qmail 70549 invoked by uid 99); 13 Aug 2009 06:20:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Aug 2009 06:20:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of egor.pasko@gmail.com designates 209.85.211.177 as permitted sender) Received: from [209.85.211.177] (HELO mail-yw0-f177.google.com) (209.85.211.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Aug 2009 06:20:24 +0000 Received: by ywh7 with SMTP id 7so771886ywh.21 for ; Wed, 12 Aug 2009 23:20:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=IPQCGo4fK4jfe0QKeyWFWKiEEmORtcutVsoRSSap9Wc=; b=EqH3KZrXVwPxCjBb/VvRLZePwEtyc0u06Cn34wvLki587p/rLpzGOQqVSQbEPMSOQ5 fVv/AreZcQ/E0Q/aYbkibSlv5rWkH6EDe7sJs8Zikk4dacEpHJXfrEg1y+M7GcRyBhLo aWx0kW4U+FrHSZpwZeYvHFkJDrQnAk0N+kul4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=bnm4fvhxvQf8TAiK55laldAImNcWYR9/PR4eTyg2NmDl4MOnwskZLgUan/dzx0Rjf7 whtMuoW7cHmTy1DpyWwbWJUiDt41j7JATlWC3TIGPwAs0WSR8oEjwra5jFrHH/e6QTfB 2lbb/r3v2hg2mLQ/8+mnvGZ+Q1yknB7godqi0= MIME-Version: 1.0 Received: by 10.90.120.14 with SMTP id s14mr409915agc.94.1250144403016; Wed, 12 Aug 2009 23:20:03 -0700 (PDT) In-Reply-To: <5b553b550908111522l7b713725odd17503e6106b5e2@mail.gmail.com> References: <5b553b550908091534o57bf9576td892a5f81043f581@mail.gmail.com> <5b06ca0e0908100118n2af35c64n9ea52184969ef3d@mail.gmail.com> <5b553b550908111522l7b713725odd17503e6106b5e2@mail.gmail.com> Date: Thu, 13 Aug 2009 10:20:02 +0400 Message-ID: <5b06ca0e0908122320j40af6ad5l1269c8292b8140db@mail.gmail.com> Subject: Re: A way to implement Krugle and Koders code search parsers for apache-rat-pd From: Egor Pasko To: rat-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org 2009/8/12 Marija =C5=A0ljivovi=C4=87 : > Hi to all, > > I just wanted to share with you some of my notice about HTMLUnit thanks! ;) >> I do not completely understand the problem being solved using this >> library, and what kind of GWT support is discussed. > > Aim is to make apache-rat-pd work with Koders.com, GoogleCodeSearch > and Krugle.com codesearch engines. > > Unfortunately, only GoogleCodeSearch provides library with API to use > this engine programmatically. Koders and Krugle don't provide something > like that. GoogleCode search engine is very powerful and has great > regex support, but in gdata-codesearch API is missing 2 things: way to > get source file and way to get language of source file. I do not understand why we need to know the language in response. We have a file, say Foo.java, we want to find, who has stolen parts of it. Stolen parts are very unlikely, to be in python, C++, Malbolge or any other language. So, we won't even try to search in that space. Let's restrict search language to java for Foo.java. Google Code Search allows that. OK, full source file is a problem, I agree. However, this is a long term problem that does not fit into SoC, IMHO. > I found that HTMLUnit can easily do this missing things. > GWT support matters only for GoogleCodeSearch, almost all google's > sites are written using GWT. To be honest, this is not true. Google products built on top of GWT are a minority. > GoogleCodeSearch too. HTMLUnit works > great with =C2=A0GoogleCodeSearch. I was wondering if there are some othe= r libraries > which can do it so well. > >> Is this to later parse the entire file and analyse it using more >> heavyweight heuristics than regexps? What kind of heuristics are they? > > If we have source file loaded in our application, we can do any > heuristic check we can imagine. > Now, we can only ask GoogleCodeSearch is something found using limited > regex( and this is much more freedom then others search engines > provide). Returned informations are list of matched code parts(singleline= ) and > link to a site where we can see matched code. =C2=A0 If we have source fi= le > available for post-processing, we will be able to, at least, do > Levenshtein distance analyze and check if there are only names of > identificators changed. We can not do it now. We can then show > matched code in our reporting tool without need to watch it in > GoogleCodesearch site. Thanks, now I understand your intentions better. But you should understand that relying on a library that 'does everything' is hard and time consuming. There will be various incompatibilities between how HTMLUnit works and how browsers work. There is no good agreement between browsers themselves about how they should work, everyone treats pages in their own, unique way. For example, what happens when there is an illegal array access? Browsers try to prevent and silently ignore this. Each in their own way, probably. Will HTMLUnit do the same or just crash? I have no idea. So, generally speaking, although the approach might seem to be universal, in fact, it is not. >> My first concern is obviously the size of the dependency, the example >> zip archive was about 7 MB, which is probably too much :) Another >> concern is that waiting for javascript to be interpreted in java is a >> very unreliable process. > > I totally agree. 7 MB is large dependency size. I hope that we can > find better solution :( > > I think that using Krugle advanced search will be very > difficult without using some library which supports JavaScript. I did not touch krugle once before. Tried today. Well, my opinion might seem irrelevant and rude, but ... krugle sucks! Really! After entering a search query foo() I had to wait 10s of seconds while the javascript-heavy page loads it's banner. Wow, how "impressive"! Loading the final results took enormous amount of time .. like 40 seconds. All this time the javascript code was doing something, sending some irrelevant data to my browser and failing with array bounds checks. (I noticed all this with firebug). All this time someone was scratching my disk intensively. Do they spy on me? Creepy. Yes, I agree, it is nearly impossible to parse krugle results without being able to execute javascript code. With firebug I found the final AJAX request that fetches results (by the way, the request fetches results less than in 10 seconds, this fact blew my mind). The request contains magic keys in parameters that were obtained by some nontrivial javascript activity. Look, these guys are completely crazy: they cook a uselessly slow interface, they fetch tons of irrelevant data with their buggy code that fails in places. Do you believe they can actually search for things? For code? I seriously doubt so. I would not even be surprised to find out that a bunch of monkeys write code with your keywords each time they get a request from you. I suggest to Let them go! > HTMLUnit can do it. Parsing JavaScript is a slow process, but again it > is much faster > then watching it in web browser :) yeah, right :) > So, advantages of HTMLUnit are: > > -It can provide to us all informations we are interested in. > -It supports all three of code search engines well, I'd say 'if we are lucky' > -Code written for scraping data using HTMLUnit is less or more > readable and maintenanceable. > -HTMLUnit is stable project and it is very popular in past years [1]. > -It has Apache license > -it is already mavenized > > Disadvantages are: > > -HTMLUnit is very large I know an alternative approach: run OS in a virtual machine, with internet, browsers, etc. Make screenshots in certain points in time, OCR them, pick code from them. The only disadvantage is that the stack is not completely Apache v.2. Yeah, and do not forget about artificial intelligence that would recognize the right suspected code with 100% precision. That's pretty universal :) > -It is used mainly to test web-pages, not to get informations from it. > -probably there are other difficulties with using it which I don't > notice so far... > > Because of that and because it is tool of choice when people choose > web-site scraping API, I thought that we can use it in our tool. I > thought about HTMLUnit size problem - we can make these parsers which > are using HTMLUnit to be optional parts of apache-rat-pd(some > pluginable architecture can be used) > > Of course, we can search more and eventually found an alternative to this > library, maybe some of this: > http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-= java > or this: http://sourceforge.net/projects/nekohtml/ Thanks for deep analysis, Marija, that is very cool that you actively investigate possible uses of 3rd party libraries. You write code samples to try them and argue about pros and cons. This is all very impressive and you should definitely continue doing that in your professional career. In this specific case I am not against the tool, rather the approach. To sum up: in the short term I do not see opportunities to improve anything dramatically with this approach. In the long term it might make sense to use super-duper javascript execution to scrape websites. I am not an expert in this area, and I am afraid of many unexpected difficulties :) You should consult with a real web hacker :) --=20 Egor Pasko