Research from the University of Chicago demonstrates GPT-4’s impressive 60% accuracy in stock selection, raising concerns about potential job losses for human analysts. AI experts raise doubts about the impact of data pollution.

【New Wisdom Introduction】GPT-4 outperformed most human analysts and professional models trained for finance when selecting stocks for humans. They successfully analyzed financial statements without any context, a discovery that shocked many industry experts. However, the good times did not last long, as an AI expert pointed out a bug in the research: the reason for this is likely that the training data was contaminated.

Recently, all the industry leaders were shocked by a paper from the University of Chicago.

Advertisement

Researchers found that the stocks selected by GPT-4 directly beat humans! It also beat many other machine learning models trained for finance.

What shocked them most was thatLLM can successfully analyze numbers in financial statements without any narrative context!

Paper address:https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311

Specifically, LLM is better than experienced financial analysts at predicting changes in earnings. This is especially true when it comes to stock picking, where human analysts face difficult scenarios that lead to biased and inefficient predictions.

Advertisement

Moreover, the predictions made by LLM go beyond just recalling training data; the insightful analysis provided by GPT-4 can even reveal a company’s potential future performance.

GPT-4’s performance is far superior, directly achieving a higher Sharpe ratio and alpha than other models.

Wharton School professor Ethan Mollick praised: This is a paper that everyone has been looking forward to.

Some netizens also lamented: It is hard to say whether it will be humans or AI that will operate in the stock market in the future…

However, just when everyone was excited, some careful researchers poured cold water on this study:This result is likely due to the contamination of the training data..

AI expert Tian Yuandong also said that the excellent performance of GPT-4 may be due to the fact that the training data set includes future stock prices, so GPT-4 directly used cheats and selected stock samples starting from 2021 based on this.

As for testing whether GPT-4 is cheating, it is not complicated in theory: just get the stock history, rename it to a new code, and input it for testing.

research content

How to measure the role of LLM in future decision making? In this study, the researchers measured the criterion by asking LLM to conduct financial statement analysis (FSA).

The main reason for conducting an FSA is to understand the financial health of a company and determine whether its performance is sustainable.

FSA is not simple. It is a quantitative task that requires a lot of analysis of trends and ratios, and also involves critical thinking, reasoning skills, and complex judgment. Usually, this task is done by financial analysts and investment professionals.

In the study, researchers gave GPT-4 Turbo two standard financial statements, a balance sheet and an income statement. Its task was to:Analyze whether the company's earnings will increase or decrease in the future.

Please note that there is a key design in this study, which is that no text information is provided to LLM. The only thing LLM can refer to is the pure report.

The researchers predict that LLMs will likely perform worse than professional human analysts.

The reason is that the task of analyzing financial statements is very complex.There are many ambiguous things involvedwhich requires a lot of common sense, intuition and flexibility of human thinking.

Moreover, LLMs’ current reasoning and judgment abilities are still insufficient, and they also lack understanding of the industry and macroeconomics.

Additionally, the researchers predict that LLMs will underperform specialized machine learning applications, such as artificial neural networks (ANNs) for return forecasting.

Because ANNs allow models to learn deep interactions that contain important clues that are difficult for general models to obtain, unless they can make intuitive inferences and form hypotheses based on incomplete information or scenarios they have never seen before.

The experimental results surprised them: LLM actually outperformed many human analysts and dedicated neural networks, and performed better!

Experimental procedures

To evaluate the specific performance of LLM, we need to start from the following two steps.

First, the researchers anonymized and standardized the companies' financial statements to prevent the LLM from potentially remembering the companies.

In particular, they omitted the company names from the balance sheets and income statements and replaced the years with labels such as t and t-1.

In addition, the researchers standardized the formats of balance sheets and income statements according to Compustat's equilibrium model.

This approach ensures that the format of the financial statements is the same for all company-year statistics, so the LLM does not know which company or time period its analysis corresponds to.

In the second phase, the researchers designed an instruction to guide the LLMs in conducting financial statement analysis and determining the direction of future earnings.

In addition to simple instructions, they also developed a CoT instruction that actually “teaches” LLMs to analyze using the thought processes of human financial analysts.

Specifically, financial analysts identify significant trends in financial statements, calculate key financial ratios (such as operating efficiency, liquidity, and leverage), synthesize this information, and form expectations about future earnings.

The CoT instructions created by the researchers implement this thought process through a series of steps.

In terms of data set selection, the researchers used the Compustat database to test the performance of the model and cross-used it with the IBES database when necessary.

The sample covers annual data of 150,678 companies from 15,401 companies between 1968 and 2021.

The analyst's sample covers the period 1983-2021 and contains 39,533 observations on 3,152 companies.

Why LLM is so successful

The researchers proposed two hypotheses for this result.

The first hypothesis is that GPT's performance is driven entirely by near-perfect memory.

GPT likely inferred the company’s identity and year from the data and then matched that information with the sentiment it learned about the company from the news.

To this end, the researchers tried to rule out this possibility and also replicated the results using new data outside the GPT-4 training period.

The second hypothesis is that GPT is able to infer the direction of future returns because it generates useful insights models.

For example, models often calculate annotated ratios that financial analysts calculate, and then generate narratives analyzing those ratios based on CoT prompts.

To do this, the researchers aggregated all the narratives generated by the model for a given company-year and encoded them into 768-dimensional vectors (embeddings) using BERT. These vectors were then fed into an ANN and trained to predict the direction of future earnings.

As a result, the ANN trained on GPT narrative insights achieved an accuracy of 59%, which is almost as high as GPT’s prediction accuracy of 60%..

This result directly demonstrates that the narrative insights generated by the model are informative about future performance.

It can also be observed that there is a 94% correlation between the GPT forecasts and the ANN forecasts based on the GPT narratives, which indicates that the information encoded in these narratives is the basis for the GPT forecasts. In explaining the future direction of returns, the narratives related to ratio analysis are the most important.

In summary, the reason why the model performs better is due to the narratives generated based on CoT reasoning.

Experimental Results

The experimental evaluation results in the latest research can be summarized into the following three highlights.

GPT outperforms human financial analysts

To assess the accuracy of analysts’ forecasts, the researchers calculated the “consensus forecast” — the median of analysts’ forecasts one month after the financial statements were released — as an expectation of earnings for the next year.

This ensures comparability between analyst forecasts and model forecasts.

In addition, the authors use “consensus forecasts” for the next three and six months as an alternative expectation benchmark.

These benchmarks are disadvantageous to LLM because they incorporate information obtained over a single year. However, given that analysts may be slow to incorporate new information into their forecasts, the researchers chose to report these benchmarks for comparison purposes.

The researchers first analyzed the performance of GPT in predicting future “earnings direction” and compared it with the performance of securities analysts.

They note that predicting earnings per share (EPS) changes is a highly complex task because the EPS time series approximates a “Random Walk” and contains a large number of unpredictable components.

The random walk reflects predictions based solely on changes in current returns compared to previous returns.

The figure below shows the comparison of the prediction performance of GPT and human financial analysts.

The results showed that analysts’ forecasts in the first month were 53% accurate in predicting the direction of future earnings, which exceeded the 49% accuracy of a simple model that extrapolated changes from the previous year.

Analysts’ forecasts three months and six months out were accurate at 56% and 57% respectively, which is reasonable because they include more timely information.

GPT predictions based on “simple” non-CoT cues performed 52% below the human analyst baseline, which was consistent with the researchers’ expectations.

However, when using CoT to simulate human reasoning, they found that GPT achieved 60% accuracy, significantly higher than the performance of analysts.

Similar conclusions are reached if we check the F1-score, an alternative metric for evaluating the predictive power of a model (based on a combination of its precision and recall).

This suggests that GPT significantly beats the performance of the median financial analyst when it comes to analyzing financial statements to determine the direction of a company.

Frankly, human analysts may add value by relying on soft information or broader context that the model cannot obtain.

Indeed, the researchers also found that analysts’ forecasts contain useful insights about future performance that are not captured by GPT.

Furthermore, research shows that GPT’s insights are more valuable when humans have difficulty making future predictions.

Likewise, in cases where human forecasts are prone to bias or are inefficient (i.e., fail to properly incorporate information), GPT’s forecasts are more useful in predicting the direction of future returns.

GPT is on par with specialized neural networks

The researchers also compared the prediction accuracy of GPT and various ML models.

They used three forecasting models.

  • The first model, “Stepwise Logistic”, follows the Ou and Penman framework and uses 59 financial indicator predictor variables.

  • The second model was an ANN using the same 59 predictor variables but also exploiting the nonlinearity and interactions between them.

  • Third, to ensure consistency between GPT and ANN, the researchers also used an ANN model trained on the same set of information provided to GPT (income statement and balance sheet).

Importantly, the researchers trained these models using observations from Compustat based on five-year historical data. All predictions were out of sample.

Using the entire Compustat sample, we find that stepwise regression achieves an accuracy (F1 score) of 52.94% (57.23%), which is comparable to the performance of human analysts and consistent with previous studies.

In contrast, the ANN trained using the same data achieved a higher accuracy of 60.45% (F1 score 61.62%), which is within the range of state-of-the-art revenue prediction models.

When using GPT (with CoT) for prediction, it is found that the model has an accuracy of 60.31% on the entire sample, which is very close to the accuracy of ANN.

In fact, the F1 score of GPT is significantly higher than that of ANN (63.45% vs. 61.6%).

Furthermore, when the researchers trained the ANN using data from only two financial statements (which were fed into the GPT), they found that the ANN’s predictive ability was slightly lower, with an accuracy (F1 score) of 59.02% (60.66%).

Overall, these results suggest that GPT’s accuracy is comparable to (or slightly higher than) that of state-of-the-art dedicated machine learning models.

ANN and GPT prediction complement each other

The researchers further observed that the predictions from ANN and GPT are complementary as they both contain useful incremental information.

And there are indications that when ANNs perform poorly, GPTs tend to perform well.

In particular, the ANN predicts returns based on the training examples it has seen in past data, and given that many of the examples are very complex and highly multidimensional, its ability to learn may be limited.

In contrast, GPT made relatively few errors when predicting earnings for small or loss-making companies, likely benefiting from its human-like reasoning and extensive knowledge.

In addition, the researchers conducted several additional experiments, partitioning the samples based on GPT's confidence in its answers and using different LLM families.

When GPT answers with a higher confidence level, the prediction tends to be more accurate than predictions with a lower confidence level.

At the same time, the study proved that this result can be generalized to other large models. In particular, Google’s recently released Gemini Pro has an accuracy comparable to GPT-4.

Forecast sources: Growth and operating margin

The figure below shows the frequency statistics of bigrams and monograms in the GPT response.

Here, a bigram refers to a word consisting of two consecutive words used together in a text; a single word refers to a single word.

The left figure shows the results for “big words”, the ten most common “big words” found in GPT's answers about financial ratio analysis.

The right side of the figure lists the ten most frequently occurring words in GPT’s binary earnings predictions.

This analysis was conducted to determine the most common terms and phrases used by GPT in different financial analysis contexts.

Interestingly, the terms “Operating Margin” and “Growth” had the highest predictive power.

It seems that GPT has internalized the “Rule of 40”.

In summary, all results suggest that as AI accelerates, the role of financial analysts will change.

Admittedly, human expertise and judgement are unlikely to be completely replaced any time soon.

But powerful AI tools like GPT-4 could greatly enhance and simplify the analyst’s job, perhaps even reshaping the field of financial statement analysis in the coming years.

References:

  • https://www.newsletter.datadrivenvc.io/p/financial-statement-analysis-with

  • https://x.com/tydsh/status/1794137012532081112

  • https://x.com/emollick/status/1794056462349861273

  • https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4835311

Advertising Statement: The external jump links contained in the article (including but not limited to hyperlinks, QR codes, passwords, etc.) are used to convey more information and save selection time. The results are for reference only. All articles in Gamingdeputy include this statement.

Advertisement