lucene-lucene-net-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <sk...@sloan.mit.edu>
Subject Re: Lucene .Net for real-time full-text search
Date Fri, 11 Jun 2010 14:58:44 GMT
I think, you have two different tasks here:

1) is to parse the html financial reports and extract data element therein.
For parsing, I would highly recommend HtmlAgilityPack [1]
You can parse the html and populate into a database for querying. If you get
200-300 files, you can run multiple threads on different servers to speed up
the processing.

2) the ability to search the contents of the filings to find the appropriate
filing(s) and then its related data. For this purpose you can use
Lucene.net, or my suggestion would be to use Solr. As it abstracts out
several of the issues in building a search server.

Hope that helps
Shashi







[1] http://htmlagilitypack.codeplex.com/

On Fri, Jun 11, 2010 at 10:02 AM, Lidia Rozhentsova <
Lidia.Rozhentsova@direkt.se> wrote:

>  Hi!
>
>
>
>
>
> My name is Lidia. Currently I’m looking for a search engine to develop an
> application for Swedish financial news maker Direkt.se.
>
>
>
> My goal is to find a search engine that allows a real-time full-text
> search. Briefly, a business process that requires such a solution is:
>
>    1. Different companies announce that they will publish particular
>    financial information at particular date and time. This information usually
>    consists of company name, financial period, financial indicator (sales,
>    gross margin, operating income)
>    2. At that date and time we receive html file with financial report (I
>    attached an example of such a file)
>    3. In the received file we have to find information that was described
>    at the first step. For example, what Sales the company had in the first
>    quarter of 2010
>
>
>
> We can have up to 100-200 files at one time and we have to find information
> that we’re interested in ASAP since time is extremely critical for the news
> maker company. So, we don’t have time for indexing files.
>
>
>
> I’ve read that Lucene starting from 2.9 version supports near real-time
> search but I’m not sure how fast it will work with the task I’ve described.
> Also, my company is interested in Microsoft technologies, that’s why I’m
> writing to .Net community.
>
>
>
> Could you, please, clarify for me if Lucene is capable to support the task
> I described or give me a link where I can read about it?
>
>
>
>
>
> Thank you very much for assistance!
>
>
>
> Best regards,
>
>
>
> *Lidia Rozhentsova*
>
>
>
>
>
> <http://www.direkt.se/>
>
> Utvecklare
>
>
>
> Nyhetsbyrån Direkt
>
>
>
> Norrlandsgatan 15
>
>
>
> 111 43 Stockholm
>
>
>
>
>
>
>
> Phone
>
> +46 (0)8 519 179 00
>
>
>
>
>
> Direct
>
> +46 (0)8 519 179 05
>
>
>
> www.direkt.se <http://%C2%A0%C2%A0www.direkt.se>
>
> Mobile
>
> +46 (0)76 062 50 45
>
>
>
> lidia.rozhentsova@direkt.se <nlidia.rozhentsova@direkt.se>
>
>
>
> This e-mail and the information it contains may be privileged and/or
> confidential. It is for the intended addressee(s) only. The unauthorised
> use, disclosure or copying of this e-mail, or any information it contains,
> is prohibited. If you are not an intended recipient, please contact the
> sender and delete the material from your computer.
>
>
>
>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message