Apr 25, 2024 16 MIN

OpenBB is experimenting with AI-driven stock screening based on unstructured data

OpenBB's innovative approach to stock screening using AI and unstructured data.

blog hero

Screening securities is one of the first steps most analysts will take in their workflow.

The idea is simple: reduce the universe of all possible securities to a more manageable number based on easy-to-filter criteria. Then, perform more in-depth research on this reduced set to narrow down to your final choice of potential securities to invest in.

Traditional screening is relatively easy; a wide variety of tools and data providers will allow you to screen stocks based on criteria, such as earnings-per-share, industry sector, market cap, price-to-earnings ratio, return on equity, etc. Setting up the various criteria yourself can be tedious, but it’s doable.

But what if you want to screen for securities that are planning on investing in AI? Or expanding into new territories? Or expecting layoffs? Or maybe being outcompeted by competition? Or involved in legal action?

Screening for these criteria involves consuming swathes of unstructured data, such as earnings transcripts, SEC filings, and news articles. This is time-consuming and not scalable.

In this post, we’ll share how we’ve been approaching this problem at OpenBB.

Structured vs. Unstructured data

First, let’s define what we mean with structured or unstructured data.

Structured data is organized, has a fixed schema, and is typically tabular. Think Excel spreadsheets, CSV files, SQL databases, etc.

Crucially for our use case, structured data is easily queryable.

Unstructured data is essentially the opposite. It is flexible, unorganized, non-tabular, freeform, and rarely follows a fixed schema. Think music, videos, news articles, earnings transcripts, etc.

Naturally, unstructured data is difficult to query, especially when trying to extract semantic information (i.e., the meaning behind the query).

Faster stock screening using natural language

Before showing how you could tackle screening stocks by extracting criteria from unstructured data, let’s first look at automating the initial screening step: screening stocks using structured criteria.

The idea here is to rapidly create a set of screening criteria from natural language. Then, the resulting set of screening criteria can either be used as a starting point, where the user can explore and then iteratively add or remove certain criteria, or used outright as the final structured screening step.

Either way, we want to improve the tedious process of setting up our screening criteria ourselves.

How to screen stocks using natural language

Workflow for screening stocks with structured filters using natural language

The workflow is as follows:

  • Capture the user’s natural language query.

  • Split the query into multiple criteria and their logical relationship.

  • For each criteria, we’ll search through a database of possible fields that could possibly screen for each criteria. Each field also has metadata, for example the unit it is in, a description, and the category it belongs to.

  • Then, we combine the criteria, their logical relationship, and the retrieved fields to formulate the final filter description that can be passed to our screening tool.

For each step in the process, we use an LLM (in this we experimented with a combination of OpenAI’s gpt-3.5-turbo and Anthropic’s claude-3-haiku chat models. We use a RAG process to retrieve the fields from the database.

Why search for the relevant fields using RAG? Why not just include all of the possible fields in the context directly? This is a great question. We do this for two reasons.

First, we want to avoid context limits: the number of possible fields is really large (and is expanding all the time). In our particular example, we have 200+ possible criteria to screen from. Including all of them would consume a lot of context and a lot of tokens (making things more expensive).

Second, LLMs tend to behave more predictably and reliably when they are constrained. By only including fields relevant to the criteria we are screening for, we provide the LLM with fewer opportunities to misbehave.

Let’s send a natural language query through the system above to illustrate how it works.

Take the following query as an example:

Select stocks with a small market cap and high revenue growth and a negative profit margin.

First, split the query into criteria and logical relationships:

small market cap |AND| high revenue growth |AND| negative profit margin

Then, we search for the correct fields for each criteria in the database:

['marketcap', 'enterprisevalue', 'revenuegrowth', 'revenueqoqgrowth', 'profitmargin', 'nopatmargin', 'normalizednopatmargin']

Finally, we combine the criteria, their relationship, and the relevant fields to generate the final screening definition. Notice how the LLM has inferred by itself what “small market cap” and “high revenue growth” means:

{
   "operator": "AND",
   "clauses": [
     {
       "field": "marketcap",
       "operator": "lte",  # <-- means less than or equal to
       "value": "2000000000"
     },
     {
       "field": "revenuegrowth",
      "operator": "gte",  # <-- greater than or equal to
      "value": "0.2"
     },
     {
       "field": "profitmargin",
       "operator": "lt",  # <-- less than
       "value": "0"
     }
   ]
}

We can then pass this to the screening tool of our choice via an API call to retrieve the results:

>>> retrieved_securities
['AADI',
 'AAMC',
 'ABEO',
 'ACIU',
 'ACON',
 ...
 ]

Now, we have a system that allows us to map from a user’s natural language query, directly to a set of filters that will screen stocks for us (based on structured criteria)!

Here is another example:

Select stocks with a large market cap and a current ratio above 0.8 and a P/E ratio below 70 and a low debt ratio.

Output:

{'operator': 'AND',
 'clauses': [{'field': 'marketcap', 'operator': 'gt', 'value': '1000000000'},
             {'field': 'currentratio', 'operator': 'gte', 'value': '0.8'},
             {'field': 'pricetoearnings', 'operator': 'lte', 'value': '70'},
             {'field': 'debttototalcapital',
              'operator': 'lte',
              'value': '0.5'}]}

We’ve found that, even for a large number of criteria (5+), we get robust outputs.

This process takes ~6-10s end-to-end, which is dramatically faster than manually searching for, selecting and entering criteria.

Screening stocks based on unstructured data

Screening stocks based on calculated criteria is nothing new, even if now we’ve made it a little more efficient using natural language. What if, instead of only filtering based on structured criteria, we could instead screen for criteria only present in unstructured data.

For example, what if we could go from this (structured criteria):

select stocks with more than a trillion dollars in market cap

To this (structured and unstructured criteria):

select stocks with more than a trillion dollars in market cap and that are planning to invest in AI technology

Setting up the “market cap” criteria in a screening tool is tedious, but researching which companies plan to invest in AI is a much larger process: you’ll need to read multiple news articles, multiple earnings transcripts, comb through multiple SEC filings, and more. This is the nature of unstructured data: messy, disorganized, and hard to query.

And that’s precisely where LLMs can shine.

If we can use AI to automate this process, we will be able to utilize unstructured data in a way that wasn’t possible before, leading to a significant informational edge.

So, let’s see if you can achieve just that.

How to screen stocks using unstructured criteria

For now, we’ll use earnings transcripts as the only unstructured data source. This can easily be expanded later.

The entire workflow is as follows:

  • Split the user’s query into structured and unstructured criteria

  • Create a shortlist of screened stocks using “structured criteria” workflow from earlier.

  • For each stock in this shortlist, fetch the last four earnings transcripts and search through the transcripts to see if they match the unstructured criteria

  • For each criteria match, we also extract citations (in this case, direct quotes) to explain which excerpt of the data matched with the criteria.

  • Return the final list of screened stocks (based on both structured and unstructured criteria), as well as the citations from the unstructured data.

An important observation is that we can easily decouple the structured and unstructured flows. This means that we can, for example, allow the user to manually change the structured criteria using a UI before running the unstructured screening workflow

Why not screen the entire universe of the stocks for the unstructured criteria?
Unfortunately, this isn’t practical due to LLM costs, API rate limits, and response times. This is why we rely on the structured criteria screening as an initial step, to filter down the number of potential stocks to a manageable shortlist. This might become possible in the future with some kind of database-esque solution that can execute LLM queries across rows in a table in a parallel fashion. For now, we’ll aim for a shortlist of around ~10 stocks.

Let’s see an example of how this works for a single stock ticker for now.

Let’s use META and screen for the following criteria: is investing in opensource.

This yields the following results:

{'matches_criteria': True,
 'references': ['Our long-standing strategy has been to build an open-source '
                'general infrastructure while keeping our specific product '
                'implementations proprietary. In the case of AI, the general '
                'infrastructure includes our Llama models, including Llama 3, '
                "which is training now, and it's looking great so far, as well "
                "as industry standard tools like PyTorch that we've "
                'developed.'],
 'metadata': {'symbol': 'META', 'year': 2023, 'quarter': 4}}
---
{'matches_criteria': True,
 'references': ['We have a pretty long history of open sourcing parts of our '
                'infrastructure that are not kind of the direct product code. '
                'And a lot of the reason why we do this is because it '
                'increases adoption and creates a standard around the '
                'industry, which often drives forward innovation faster so we '
                "benefit, our products benefit, as well as there's more "
                'scrutiny on kind of security and safety-related things so we '
                "think that there's a benefit there.",
                'We also are building foundation models like Llama 2, which we '
                'believe is now the leading open source model with more than '
                '30 million Llama downloads last month.'],
 'metadata': {'symbol': 'META', 'year': 2023, 'quarter': 3}}
---
{'matches_criteria': True,
 'references': ['We partnered with Microsoft to open-source Llama-2, the '
                'latest version of our large language model and to make it '
                'available for both research and commercial use.',
                'We have a long history of open-sourcing our infrastructure '
                'and AI work from PyTorch, which is the leading '
                'machine-learning framework to models like segment anything, '
                'image bind and DINO to basic infrastructure as part of the '
                'open compute project. And we found that open-sourcing our '
                'work allows the industry, including us, to benefit from '
                'innovations that come from everywhere.'],
 'metadata': {'symbol': 'META', 'year': 2023, 'quarter': 2}}
---
{'matches_criteria': True,
 'references': ["Unlike some of the other companies in the space, we're not "
                'selling a cloud computing service where we try to keep the '
                "different software infrastructure that we're building "
                "proprietary. For us, it's way better if the industry "
                "standardizes on the basic tools that we're using. And "
                'therefore, we can benefit from the improvements that others '
                'make and others use of those tools can, in some cases, like '
                'Open Compute, drive down the costs of those things, which '
                'make our business more efficient, too.',
                'We open source many of our state-of-the-art models so people '
                'can experiment and build with them. This quarter we released '
                'our LLaMa LLM to researchers. It has 65 billion parameters '
                'but outperformed larger models and has proven quite popular. '
                "We've also open sourced 3 other groundbreaking visual models "
                'along with their training data and model weights, Segment '
                "Anything, DINOv2 and our animated drawings tool, and we've "
                'gotten some positive feedback on all of those as well.'],
 'metadata': {'symbol': 'META', 'year': 2023, 'quarter': 1}}

We can see that there is lots of evidence from the last four transcripts that META is investing heavily in opensource. Let’s take one of the quotes at random, and see if the model has quoted the transcript correctly. This request took ~3-5s to execute, which is a lot faster than reading the transcript.

Let’s use the following extract (from our first result), which was extracted from the META Q4 2023 transcript:

Our long-standing strategy has been to build an open-source general infrastructure while keeping our specific product implementations proprietary. In the case of AI, the general infrastructure includes our Llama models, including Llama 3, which is training now, and it's looking great so far, as well as industry standard tools like PyTorch that we've developed.

If we look at the transcript itself, and search for the extract, we see that we have a perfect match:

Since unstructured data is large and difficult to query, it is immensely valuable to have direct quotes that can be used as evidence or as a direct reference, for example when writing a report.

Let’s run the same search, this time for AAPL. We get the following results:

{'matches_criteria': False, 'references': [], 'metadata': {'symbol': 'AAPL', 'year': 2024, 'quarter': 1}}
---
{'matches_criteria': False, 'references': [], 'metadata': {'symbol': 'AAPL', 'year': 2023, 'quarter': 4}}
---
{'matches_criteria': False, 'references': [], 'metadata': {'symbol': 'AAPL', 'year': 2023, 'quarter': 3}}
---
{'matches_criteria': False, 'references': [], 'metadata': {'symbol': 'AAPL', 'year': 2023, 'quarter': 2}}

Clearly (at least from the four most recent earnings transcripts), we can see that open-source software is not a high priority for Apple.

If companies investing in open-source software was a large part of your investment thesis, our approach would’ve screened out AAPL, and kept META.

And it would’ve taken only a handful of seconds. Nice!

Pursuing the future of screening stocks using AI at OpenBB

This was a technical overview into how we can leverage AI and LLMs to gain a significant informational edge when it comes to selecting and screening stocks.

First, we showed how we can map directly from natural language to stock screening criteria, speeding up the traditional stock screening workflow.

Second (and more significantly!), we’ve shown how you can use LLMs to screen stocks based on criteria that is only found in unstructured data, such as earnings transcripts. Performing this process manually across multiple candidate stocks would take a human analyst multiple days, but instead we can perform the same analysis in a handful of seconds.

This forms just a part of our ongoing research and investment at OpenBB into AI and LLM tooling that aims to improve and rethink how investment analysts perform their day-to-day work.

If this is a use case that is interesting to you, please feel free to reach out!

We’d love to have a conversation about what we’re working on.

See you next time!

Explore the
Terminal Pro


We use cookies

This website uses cookies to meausure and improve your user experience.