Finance Crawler (Part 1)

  • by

Jason Kwok's Finance Crawler #1

Ever since moving to Seattle, I have been meeting a very large number of awesome people that are super involved with the technology scene. This can of course be largely attributed to the influence of Microsoft and Amazon, both of which obviously employee a huge tech interested and tech fluent work force. In contrast to Los Angeles, this has been a refreshing change of company. More importantly, this has been a revitalizing reminder to work on some projects that have been on the back burner.

I finally created a semi-functional financial web crawler that organizes some very elementary aspects of financial data and stores it into a mysql table. It works, but it doesn’t work perfectly, and there is a ton of stuff I would love to get input on.

Current Status
How it works: My crawler pulls HTML from one big page of Yahoo finance, stores all the industries into an array, and one by one dives into an industry page. Once on the industry page, it pulls all the companies in that industry and stores the ticker symbol. Once it pulls the ticker symbol, it then goes to the specific company’s webpage and pulls specific company information. This was necessary to do because I had no idea what the current stocks are, what their ticker symbols are, and what industry they belonged to. Now that I do have a record of large number of ticker symbols, I plan to store ticker symbols, basic company information, and industry information in a separate table, and draw from that table in order make sure no stocks are missed. It doesn’t make sense to draw from an un-ordered list of stocks from Yahoo Finance’s page every single time. Computer science folks, do you have any thoughts about how to structure this, or is this is even a good idea?

Financial data: Right now my crawler looks at only two pieces of data: the P/E (Price to Earnings) ratio of the industry the company. I would love to expand this to the point that I can actually analyze certain companies financial data and make some educated decisions. According to my current crawler, and assuming that a lower P/E compared to the industry necessarily means the company is undervalued, the best stock to buy is China Nutrifruit Group Limited (ticker symbol: CNGL) at a price to earnings of .06! Interestingly enough, under the definition that a lower P/E compared to the industry P/E is undervalued, 12 out of the top 36 stocks have “China” in their name. Are China companies inherently more sketchy? I would love to understand the qualitative reasoning behind these valuations and why so many Chinese companies are either undervalued or about to die. Banker and finance-eers: what other data should I look at? What else would be super interesting to draw together? Remember, this doesn’t even need to be limited to specific and public financial information. My goal would be to try and accumulate other bits of data from other website in attempts to make rational predictions of trends.

Regular Expressions: My regular expressions do not cover all case scenarios, and fail to properly draw information for 413 out of the sample of 969 stocks I have attached here. Also, some of the “Nulls” are stored as 0s, before I changed it to “null.” This could either mean that P/E doesn’t actually apply to the stock/asset type, or my regular expression just fails to cover all topics. My code for P/E is comp_price_e = re.findall(‘(Trailing P/E \(ttm, intraday\): )([0-9]*)(\.?)([0-9]*)’,urllib.urlopen(‘’+comp_specificpage[0][1]+’+Key+Statistics’).read()) , where comp_specificpage[0][1] is the ticker symbol. How to fix? Or at a more basic level, anyone have any clever ways of trouble shooting this besides manually visiting pages that failed and investigating the problems?

Asset types: What types exist out there that I should be aware of? Right now, my crawler assumes that this is a profit generating company. I don’t know what happens when we get to asset types such as commodities.

Missing Data: I tried to query data for APPL from my list, and came up blank. In fact, a large number of really well know companies such as INTC also don’t show up in the list of companies that I crawled. I don’t know where I am going wrong, but I hypothesize that my occasionally faulty internet connection may break off the script, and not that my regular expressions are incomplete. This actually leads to my biggest question right now. Does anyone know if there is an API that I can use to draw tons of financial information easily? While reinventing the wheel can be very educational, there has got to be a better way of gathering this data.

Storing numbers into MySql: This one was a little tricky to get around. My regular expression code, re.findall(), initially stored the numbers as strings. I had to convert them to integers, while preserving the decimal value of numbers. I found that integers were rounded, as well. To get around this, all numbers are currently stored as if they were multiplied by 100X. This means that a P/E ratio of 41.19 is actually stored as 4119. Computer science folks: is this the proper way of doing things?

MySQL backend: I currently store my data into one table, with columns for the date, the industry, the industry url, the company, and the respective company information. This is something that I would LOVE to hear a computer scientist give me perspective on. What is the best way to store data? Assuming that there is a one set all list about the companies, and then there will be daily entrees of thousands of bits of information for the price, the earnings, and etc.

Server: I’m running this off my Virtual Box Ubuntu on my tower. It works, but I need to manually run it, my internet speed isn’t necessarily the fastest, and I can’t really have it running all day long. I’m thinking of using switching to some online hosting, such as AWS. Given my direction from above, does anyone have any thoughts? I also have yet to look into Cron jobs, but since my crawler doesn’t really work too well right now, I won’t want it to run all day.

I have attached a .csv output of my crawler’s output for people to look at. Please, let me know what you think!

Leave a Reply