Friday, February 29, 2008

Project idea: structured financial data

As they are, things are pretty good for everyday investors on Main Street. The United States Securities and Exchange Commission makes the filings of public companies readily available on its website. If someone were interested in researching a company thoroughly, pretty much all the information is there: how much debt a company is carrying around, what kind of revenues it's pulling in, what the company's assets are worth, and so on.

So, things are pretty good, but they could be a lot better.

Now, I say "investors" and not "traders" because traders will typically need more than just information from the SEC. It's also a clear distinction that will limit the scope of the project I'm about to propose so that it's very clear what will be needed (and, more importantly, what will not).

The current situation

All this data is in HTML. It's great for human readability. Right now, that's exactly what everyone's doing: reading them one by one and poring over them. This is probably because there are a lot of loose, interesting facts that are specific to each company's way of reporting things. But among all that loose, unstructured data are structured tables containing information like a company's balance sheet, its cash flow, and its earnings.

For a stock market related project my team and I are doing, such data in a structured format would go a long way. Currently, there are data providers who would provide this kind of thing for a monstrously large monthly fee, perhaps because they bundle it up with other things. But this is public data. Nobody should have to pay a fee for it.

I personally tried to write a parser to automatically pull this data. I was reduced to entering some data by hand, but with around 8,000 companies to deal with, that would take me a couple of years, full-time. There is no way that this is feasible. Entering data by hand was pretty easy, but really tedious. What a human being can deduce by a quick glance, an automatically programmed parser would have trouble with.

Such a task would be, in my opinion, well-suited for being built and maintained by an open community. It's a big job — not particularly hard, just a lot of work. It doesn't make sense for any one company to feed these in; the ones that have this data are charging high rates for it. Everyone should be able to benefit from it, so contributors should include people from various organizations.

A proposed solution

The best solution I can think of would be to take the lead on the project initially. There would be a project website running on a simple web application. Since the number of companies is known, there would be a frequently updated status report saying what percentage of the task is done. For example, the front page could say "3000 out of 7500 companies with complete data" or "400 companies with incomplete data." To ensure the project's open nature, there would be a daily or weekly dump of the database. That way, no one party will control the data completely. Additionally, the community would have a reasonably complete backup.

It is essentially a massive data entry project. Not so fun, but for the participants there is potential fame and fortune as well as a strong mission. Once the data is available, programmers and financial analysts can graph data that would be buried in tables and easily compare data for multiple companies in a timespan on the order of a few seconds, rather than the current process, which could take minutes, hours, or even days.

The fields that would be stored would be standard items on a company's balance sheet, earnings statement, and cash flow statement. Eventually, if further patterns can be found, more fields would be added. There is a much more ambitious initiative which attempts to address the problem of all the fields, which I will discuss below.


Right now, there's a major push underway to address this exact problem: XBRL, which stands for eXtensible Business Reporting Language. I looked into it several times, and have found that it is characterized by slow progress and a quiet reluctance from a lot of the parties involved. Companies in the XBRL test program are few and far between. A lot of big names in financial research are entrenched, and they depend on keeping this structured data as revenue sources. It should be noted that if XBRL somehow succeeds, this open community project I'm proposing would not be needed at all.

The eventual XBRL specification that would be used for public companies' SEC filings would have structured fields to address all the data reported, but this could be a potentially huge problem because of the learning curve for new software tools that report preparers will have to get used to.

If there are any interested readers out there, please leave a comment and let me know what you think. This idea has been in my head for a couple of weeks now and I'm certain I left some gaping holes in my discussion of what I have in mind for this project.


Steven Loi said...

I just wanted to let you know I reblogged this entry onto my website to gain further reach.

jgardner said...

I wanted to let you know we have already developed an XBRL parsing engine called iParser. As you know when countries such as China, Singapore and many others around the world having already mandated this new format, the pending SEC mandate should not be far behind.

Simple stated, our solution “parses” the XBRL documents, into an organized data store for ready consumption:

NeoClarus iParser Capabilities:
o Stages, parses, and validates multiple XBRL instances as a whole, or included with in a XML/HTML DOM into query-ready relational database
o Automatically Identifies XBRL link base referred locally or over the Internet
o Automatically recognizes and parses newly added instances
o Pinpoints any semantic, syntax, and parsing errors describing them in an error log
o Creates a relational database backward compatible with instance changes, and downstream compatible with Oracle, MS SQL Server, MySQL, and IBM DB2
NeoClarus iParser Benefits:
o Automated batch parsing reduces manual efforts and speeds instance database creation
o Built-in validation and error reporting eliminates errors prior to data publishing
o Builds a scalable instance database for reporting or incorporation into other common databases
o Easily incorporates into existing data and reporting infrastructures
o Serves XBRL taxonomies of all participating countries (i.e., Chinese, Japanese, Spanish, US-GAAP etc. )

I look forward to speaking with you further to see how we could work together. Please contact me at 408-858-4994 or


John Gardner
Vice President, Business Development

Unknown said...

The end aftereffect was the Romain Jerome "Day&Night," a $300,000 Replica Rolex Watches fabricated with $.25 of decayed metal that does NOT acquaint the time! It may accept two of the granddaddies of all complications, the tourbillon, but after the abject of basal timekeeping, it's harder to see the point.