Introduction
When working in FinTech, dealing with vast amounts of data, in all its shapes and sizes, can be a real pain point.
Data can be wildly different between providers, it often requires cleaning and can be hard to reconcile against other sources of truth.
API conventions differ between providers, the data often requires transformations, and it can be hard to quickly navigate to the point of failure, to name a few.
If you have never encountered the data science acronym ETL, it stands for:
Extract
Transform
Load
They are general processes that acquire some data, clean it in preparation for use, and then load it into memory or storage for the end user.
At OpenBB, we conquer data with a TET (Transform—Extract—Transform) pipeline.
The OpenBB Platform plays the role of the staging area and divides the process for fetching data into three steps — TET:
Transform the query.
Extract the data.
Transform the data.
The TET Pipeline
The TET pipeline is contained to the Fetcher class, and it is the main executor possessing the logic for how it should behave. Each data source and provider inherits from this class and sets the QueryParams and Data definitions as the instructions for execution and validation.
It begins with QueryParams
, which are Pydantic models of parameters for querying a specific provider endpoint. The model is validated from the user-supplied input. For example,
obb.equity.price.historical("aapl", start_date="2024-01-01", provider="yfinance")
The function inputs go into the first stage of the pipeline, Transform Query.
1. Transform Query
Every provider has different words for the same thing, but OpenBB translates them in a standardized interface.
Furthermore, some may require pagination to get all the data requested, and others may not allow filtering the request directly.
There are incorrectly documented parameters and response types to deal with; worse yet, undocumented. It can feel like they don’t even want you to know how to use the service they provide.
The table below compares a few basic parameter names for historical stock prices.
Parameter | Polygon | Intrinio | FMP |
---|---|---|---|
symbol | stocksTicker | identifier | symbol |
start_date | from | start_date | from |
end_date | to | end_date | to |
interval | multiplier + timespan | Interval_size | timeframe |
Requires Pagination | Yes | Yes | Unclear |
Max Results Per Page | 50,000 | 10,000 for daily/monthly/weekly but 500 for intraday | Unclear, maybe limited via date range |
Additional Considerations | Interval can be any combination of multiplier and minute/day/week/month/year | Different endpoints for intraday and daily/weekly/monthly. Deviation from conventional meaning of, intraday | Endpoints are poorly labeled and documented. |
Request Structure | Standard URL query string | Standard URL query string | Same as Intrinio |
Parameter | Polygon | Intrinio | FMP |
---|---|---|---|
Interval | multiplier =60 timespan=”minute” | interval_size=”1h” | timeframe=”1hour” |
Similar differences will exist between the accepted values for these parameters. What is the syntax for a one-hour interval?
The OpenBB Platform smooths these differences, seamlessly switching between sources.
Comparing multiple sources of data is as simple as changing the “provider” parameter.
import pandas as pd
from openbb import obb
compare = pd.DataFrame()
for provider in ["yfinance", "alpha_vantage", "tiingo", "polygon", "intrinio", "fmp"]:
compare[provider] = obb.equity.price.historical(
"AAPL",
start_date="2024-03-03",
end_date="2024-03-15",
interval="1d",
provider=provider
).to_df().get("close")
2. Extract Data
When you call
obb.equity.price.historical(
“AAPL”,
interval=”1d”,
start_date=”2024-03-03”,
end_date=”2024-03-15”,
provider=”yfinance”
)
a standardized object is returned.
Once the parameters are molded to fit a particular provider-specific schema, they are passed on to the second stage which is the Extract Data stage.
This stage is where the bulk of the work gets done, grabbing the data from the provider. The transformed parameters are used to query the source and return the raw data.
Depending on the type and complexity of the data, multiple API calls may be required. The final results are a concatenation of each request, typically served as a list of dictionaries on output.
Some light parsing and shaping may be required during this stage, but generally, the data is still “as-is” from the provider. Breaking up the process here allows the developer to understand if a point of failure is within the data request or if it is an error within the data itself.
The sample data below is from Polygon. Here, “results” are what will be passed through to the next stage.
{
"adjusted": true,
"queryCount": 3,
"results": [
{
"T": "KIMpL",
"c": 25.9102,
"h": 26.25,
"l": 25.91,
"n": 74,
"o": 26.07,
"t": 1602705600000,
"v": 4369,
"vw": 26.0407
},
{
"T": "TANH",
"c": 23.4,
"h": 24.763,
"l": 22.65,
"n": 1096,
"o": 24.5,
"t": 1602705600000,
"v": 25933.6,
"vw": 23.493
},
{
"T": "VSAT",
"c": 34.24,
"h": 35.47,
"l": 34.21,
"n": 4966,
"o": 34.9,
"t": 1602705600000,
"v": 312583,
"vw": 34.4736
}
],
"resultsCount": 3,
"status": "OK"
}
3. Transform Data
The Transform Data stage is where the finishing touches are made. Much like the parameters, data fields will vary greatly between providers. Standardized names and type enforcement are applied here.
The sample data above would be mapped to any standard fields using the __alias_dict__
property of the Pydantic model. Extra, provider-specific, fields are then defined below in the model.
class PolygonEquityHistoricalData(EquityHistoricalData):
"""Polygon Equity Historical Price Data."""
__alias_dict__ = {
"date": "t",
"open": "o",
"high": "h",
"low": "l",
"close": "c",
"volume": "v",
"vwap": "vw",
}
transactions: Optional[PositiveInt] = Field(
default=None,
description="Number of transactions for the symbol in the time period.",
alias="n",
)
As part of a general ETL pipeline, the output of this process fits somewhere near the end of the “Transform” stage. It may need further transformation to conform with any specific requirements, but most of the annoying details have been taken care of, such as:
Data is guaranteed to be JSON serializable.
Data is delivered as a validated Pydantic model.
Types are strictly enforced.
Numbers are numbers, dates are dates, and strings are strings.
NaN, empty strings, and the various string representations of None are converted to null.
Field names are always in “lower_snake_case” and have standardized names across all data sources, wherever possible.
OHLC+V will always be: “open”, “high”, “low”, “close”, and “volume”.
When you start comparing the same data between several sources, the transformation is what makes it possible. The OpenBB Platform takes care of the details so as a user, you don’t have to worry about whether a particular dataset returns numbers as a string or figure out which format the date is stored in and how to parse it.
Why TET?
Working with TET allowed us to understand its strengths and weaknesses better.
Pros of TET:
It helps us build and ship faster by having a clear and defined structure to follow.
It allows us to test each part of the Data retrieval logic.
It segregates errors, making them easier to spot and deal with.
It keeps the codebase organized and tidy.
It introduces ease of extendability and modularity.
Cons of TET:
It can be an overly verbose approach for quick implementations.
It heavily relies on the data standardization framework.
Some steps can be redundant (e.g., not all data or parameters require transformations)
Conclusion
OpenBB uses the Transform-Extract-Transform pattern because it provides the framework for standardizing queries and data across all sources.
You can change your investment research for free today.
Check out our Platform or explore the Documentation to start now!