No Data Lakes Please!

Gideon Smith of Rosenberg Equities believes machine learning will offer investors plenty of opportunities to create a special sauce for their investment process. But if you think it is simply about capturing large amounts of data, then you’ve got it wrong, he says.

Gideon Smith has just downloaded the latest version of the company’s natural language processing engine on his laptop and he is visibly excited.

Smith, who is AXA Investment Managers’ Chief Investment Officer & Global Head of Portfolio Management – Rosenberg Equities, and his team have been building different lexicons for machines to read specialised content, including corporate filings, traditional and social media content and even environmental, social and corporate governance (ESG) documents.

For example, the team has built a language model for reading Twitter feeds. “This model is trained for social media, so it finds words like ‘swift’ … and it will even recognise emoticons,” he says in an interview with [i3] Insights.

As quantitative investors, Smith and his team started out by building a lexicon for business language and they have been running the system over 10-K filings in the United States, which are annual reports from listed companies on their financial performance.

These reports, filed with the US Securities and Exchange Commission and accessible through its EDGAR database, are far more detailed than the standard company annual report and allow the team to build a picture of a company’s health not only through the figures provided, but also through analysing the sentiment of the language in the reports.

“[The model] is counting the numeric language that is being used. How much quantitative language are people using versus qualitative language? We are looking at that numeric ratio,” Smith says.

“What the research is showing us is that when you are reading earnings transcripts, companies that have more numeric language, literally talking facts and figures, there is more information in that transcript and it is a positive signal for the forward earnings.”

The model does the same for the analysis of ESG language and searches for certain business themes, not just across financial reports, but also in mainstream media.

“We run exactly the same models for reading newspapers. One of them is reading The Guardian, because The Guardian has a very nice news service feed that you can pick up digitally,” Smith says.

“I wanted to compare a business language model where we are looking for words like claim, concede, enforce and enable. So there are different language models that you can use when you do this sort of thing.”

It is applications like these that get Smith excited about the possibilities of machine learning. It is not that this type of analysis has never taken place, it has, but now it is possible at an unprecedented scale.

“We have an army of robot accountants that restate the financial statements of 20,000 companies on a consistent basis. I started my life as an accountant and I know what goes into accounting,” he says.

“Accounting standards have evolved, which is great, but the devil is in the detail and two sets of accounts are not comparable and so you have to make a whole series of adjustments. In this case, we have robots do that for us.”

Smith likes to be specific in his application of machine learning and harnesses the power of computing to scale up the tried and tested techniques of fundamental analysis. He is less impressed by broad sweeping attempts to capture as much data as possible with no clear idea as to what benefit it might bring to an organisation.

“People talk about having a ‘data lake’. It is people’s response to Big Data, where you have them storing all of their data in a big pool. You just dump it all in there and then you hope to extract value from it,” he says.

“I think that is an incredible unhelpful metaphor and you miss the most important thing, which is the refining element. You have to process the data before you can use it. Whether it is restating accounts, or running natural language processing over earnings transcripts, all of that has to happen before it is of any use whatsoever.

“I use the metaphor: yes, data is the new oil, but not because it’s valuable, although that’s also true, but mainly because it needs refining.”

Integrity of Data

Today, more and more alternative data sets are becoming available and the trick is to figure out whether you can extract value from them and do it quickly because you can spend years analysing data sets without getting anything in return.

But many new data sets come from providers with equally recent pasts and so investors need to run the ruler over what type of organisation provides them and how they operate, Smith says.

“We increasingly have to ask ourselves questions about the data vendors. Are they credible? What is their background? How long have they been doing this? How do they produce the data? Is it still going to be there next month?” he says.

“For example, Apple changed the terms of its app store a few months ago and a whole set of data vendors ceased to have viable business models because they were relying on people’s geolocation data or they were scraping people’s iPhones.

“You can’t do that anymore, and rightfully so.”

Smith is less concerned about the problem of signal decay because the firm isn’t chasing short-lived factors that rely on no one else knowing about them.

“There are some choices you have to make in your investment process. You can build an investment process which is all about finding the five new factors of this month, arbitrage the hell out of them quickly and then moving on and find the next five things,” he says.

“Or, and I think this is what we do, you can adopt a more fundamental approach. I’m looking for things that have a grounding in economics and have an economic rationale to them. And then we ask: Can I make a better investment decision by having more information?

“That means there are a lot of alternative data sets that I’m not interested in because they are spurious, they don’t have great coverage and that don’t necessary tell me something about the fundamentals of companies.”

It doesn’t necessarily mean Rosenberg only uses financial data. There are plenty of sources that provide information on the future direction of companies that deal with non-financial data, including the level of diversity in a company or its use of fossil fuels.

But all information ultimately has a connection to a company’s future growth potential and this relationship can change over time, Smith says.

“I think the language around factors can obscure what’s really going on. Say I want to build a value factor. What I really want to do is to buy companies who are going to deliver more earnings for every dollar I invest on average. If I understand that connection, then it’s going to allow me to actually build a better factor,” he says.

“[But] what we are increasingly seeing is a breakdown in the correlations of what I call simple, naive factors and the return premiums associated with them. Historically, there was a premium associated with buying cheap companies based on book-to-price. I think that’s beginning to break down. The reason for that is because the value inherent in companies isn’t necessarily expressed in the book anymore.”

He refers here not only to the more capital-light models of large technology firms, but also to the increased weight of intangibles in valuations across various industries.

“When I started at Rosenberg 20 years ago, intangibles were 20 per cent of the book. Now they are 80 per cent of the book and they don’t necessarily represent all of the value that is there,” he says.

“The valuations of Google, Alphabet and Amazon don’t reflect the data assets that they own. But somehow the market is considering them in how they value the company.

“So I think with some of the traditional metrics of building factors, seeking value is beginning to break down and I think there’s space there for managers to create some proprietary insights and secret sauce that lets them build a better investment model.”

Are Factors Behavioural?

There is an argument to say certain factors are more persistent than others because they are rooted in the way investors behave in markets. It is these factors that are often easier to exploit in the long term, Smith says.

“I’ve made the association between factors, future fundamentals and the economic rationale [before]. The reason they persist and don’t get arbitraged away immediately is precisely because of those behavioral facts,” he notes.

“We’ve been managing low-volatility, quality strategies for a long time now and it continues to surprise me how the markets, how individuals collectively, even with all the knowledge we have about how these things work, can persist in some of the biases that affect their investment decisions.

“Many people prefer to get rich quick than get rich slow. People have more faith in stuff they touched and felt and are overconfident about companies that they’ve examined versus the ones they missed out on.

“So I do think that the discipline of a quant approach, using a computer to do this work, can create opportunities and advantages. But it has to come with a decent dose of humility, which is about this idea of understanding what’s going on. That’s why you’ve got to find the economic connection.”

Changing the Organisation

The fact investors use computers more intensively to help them with their strategies also means the organisation has to change. It is no longer just about the star fund manager; increasingly data science teams are an integral part of the process. But how these two sections work together is where the real competitive advantage lies, Smith says.

“We’ve learned as much about how to organise ourselves as we have about the technology,” he says.

“The barriers to entry on technology are becoming lower and lower, but the winners and losers in this space are going to be the people who can actually organise themselves to take advantage of new forms of machine learning, more new data sets, and can get it to market fastest.”

As machine learning techniques become more prevalent, it might also become harder as stakeholders in the financial system will catch on how to deal with these models. A cynical version of the future could see an arms race in analytical tools by actors with different interests, Smith says.

“We are building natural language models that read companies’ 10-K filings. Those models will form a view, make assessments, create a signal on how positive or negative we view this company’s earnings,” he says.

“If I have computers and robots reading these filings, then somebody is going to create a robot to write these things and might try to game [the process]. Again, I’m not worried that we won’t find interesting ways to do this stuff, but there are all sorts of things to speculate about in the future.”

__________

[i3] Insights is the official educational bulletin of the Investment Innovation Institute [i3]. It covers major trends and innovations in institutional investing, providing independent and thought-provoking content about pension funds, insurance companies and sovereign wealth funds across the globe.