Mailing-List: contact vxquery-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: vxquery-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <5164B499.20205@yahoo.com>
Date: Tue, 09 Apr 2013 17:38:49 -0700
From: Vinayak Borkar <vborky@yahoo.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130328 Thunderbird/17.0.5
MIME-Version: 1.0
To: vxquery-dev@incubator.apache.org
Subject: Re: [Shepherd] Checking In With A Challenge
References: <E91AA032-DC42-4DC9-AB69-50AE0C0906E2@comcast.net>
In-Reply-To: <E91AA032-DC42-4DC9-AB69-50AE0C0906E2@comcast.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Dave,

The goal of VXQuery is to provide a parallel XML data processing system 
based on the XQuery language (http://www.w3.org/TR/xquery/). VXQuery 
achieves high performance by using lots of commodity machines to execute 
parts of a query in parallel, much like Hive does with SQL queries. In 
contrast with systems like Hive and Pig, VXQuery is built upon a more 
flexible data-parallel runtime platform called Hyracks 
(http://hyracks.googlecode.com).

To process large amounts of EDGAR data, one would start by distributing 
the XML files across disks of different machines running VXQuery in say 
a folder on each machine (/data/edgar).

Q1:
A query as follows would then count the total number of XML files across 
all the machines:

count(collection('/data/edgar'))


Q2:
To answer your question of find 10-Ks (sorted newest to oldest) for a 
company, you would say (I am making up the field names, but EDGAR 
contains equivalent fields in each document):


for $doc in collection('/data/edgar')
where $doc/documentType = '10-K'
   and $doc/company = 'ACME Inc.'
order by xs:dateTime($doc/filingDate) descending
return $doc


In the upcoming release of VXQuery, the only trick in the book to speed 
up queries will be to scan data on each machine in parallel to filter 
and aggregate results locally and then combine the partial results 
obtained at each machine.

One of the projects planned for the future (next release hopefully), is 
to be able to build an index locally on each machine on various fields 
of the document so that Q2 can be answered even more quickly by 
performing an index lookup instead of having to scan all the files.


Hope that gives you a glimpse into XML query processing with VXQuery.


Vinayak

On 4/8/13 5:00 PM, Dave Fisher wrote:
> Hi XVQuery Devs,
>
> I'm your volunteer incubator shepherd for this report. I can see from your activity and report that you seem to again have a need for mentors.
>
> I signed up in part because I was intrigued by the comments regarding querying the filings at sec.gov managed by Edgar. I know a bit about the structure and challenges of this data. XBRL, HTML and PDF documents contain the accounting data with some great variation, etc.
>
> Tell me how VXQuery would a quick performant solution to make a query like latest 10K, Total Revenue, All US Companies? Or all 13D Filings of >%5 ownership of US companies in the last year?
>
> Regards,
> Dave
>