blog hero

Mar 14, 2023 5 MIN

OpenBB "Whisper": Intelligent Video Analysis utilizing Transformers

Martin Bufi, our resident AI expert, talks about how he ideated and created a new OpenBB feature set that can listen, analyze, and provide summaries of video content.

Over the weekend, I was searching for a software that could help summarize a YouTube video but couldn't find anything that could cater to my specific requirements. Being the engineer I am, I decided to work on developing a tool that could help me accomplish my task. Utilizing some great open source projects, I had created a prototype that showed great potential.

However, what initially started as a personal project soon transformed into something much bigger. I realized that this tool could also be utilized for financial research while summarizing informational videos and earnings calls - even if the videos were in different languages! Excited by this prospect, I shared my discovery with my colleagues during the week. And with the collective effort of my talented team, we quickly integrated this feature into our terminal.

Thus, the team at OpenBB are thrilled to finally introduce and share this feature with our users, which allows you to automatically grab any YouTube video (e.g. Financial Earnings call), translate and transcribe its audio, summarize all the content and analyze its sentiment.

The inspiration for this feature came from a desire to streamline the process of analyzing financial news and market trends. With so much information available online, it can be overwhelming to keep up with all the latest news and events. Our feature aims to simplify the process by extracting the key information from any given YouTube video and presenting it in a concise summary, along with an analysis of its sentiment. This feature is in beta, and we plan to make updates based on user feedback. We can't wait for you to try it out and see how it streamlines your financial analysis!

The general pipeline for our feature involves several language models, each of which performs a specific task. First, the OpenAI Whisper library is used to transcribe the audio into text. This library uses state-of-the-art speech recognition technology to transcribe the audio and then generates a summary using the BART model. Finally, the Hugging Face Transformers library is used to perform both summarization and sentiment analysis on the transcript text.

Let's take a closer look at each language model and their tasks.

Translating and transcribing audio to text - OpenAI Whisper

OpenAI Whisper:This model is the most important feature in the entire pipeline. It is a Transformer sequence-to-sequence model that is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

(Source: GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision )

Generating summaries - BART

BART: The BART (Bidirectional and Auto-Regressive Transformer) model used by OpenAI Whisper is a state-of-the-art language model known for its ability to generate high-quality summaries. BART uses a combination of auto-encoding and auto-regression techniques to generate natural language text that is coherent and relevant to the input text. BART is particularly well-suited for generating summaries of news articles and other text-based content, making it a great choice for summarizing the transcribed audio from YouTube videos.

(Source: https://arxiv.org/pdf/1706.03762.pdf)

Sentiment analysis - DistilBERT

DistilBERT: Similarly, the DistilBERT model used by Hugging Face Transformers is a highly efficient variant of the popular BERT (Bidirectional Encoder Representations from Transformers) model. DistilBERT is designed to be more lightweight and faster than its predecessor, making it well-suited for sentiment analysis on short pieces of text. DistilBERT uses a pre-trained transformer architecture to extract features from text, which are then fed into a classifier to predict the sentiment of the text.

(Source: https://www.researchgate.net/figure/The-DistilBERT-model-architecture-and-components_fig2_358239462)

Each of these models was chosen based on its accuracy and efficiency in performing its specific task, as well as its ability to run on any hardware. It was important to take into consideration the speed and the different computer hardware needed to run this feature on. This efficient combination allows this feature to be used by as many of Openbb’s users as possible.

Getting started

To get started, make a new conda environment and install all terminal dependancies.

conda env create -n whisper --file build/conda/conda-3-9-env-full.yaml
conda activate whisper
poetry install -E all

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
Then head to the forecasting menu on the terminal and enjoy!
forecasting
whisper --video https://....

Future plans

However, as newer and more advanced language models become available, the development team is exploring new models and techniques to improve the accuracy and efficiency of the feature. For example, the team plans to explore the use of ChatGPT (Generative Pre-trained Transformer) in future revisions, which is a state-of-the-art language model capable of generating human-like text in response to user prompts. Along with that, there is discussion on adding in embedding for efficient search to give users the ability to find keywords/mentions within the video. :fire:

We hope that this new feature will be useful to our users in their financial analysis and decision-making. As always, we welcome any feedback or suggestions for how we can continue to improve our platform.

Explore the
Terminal Pro


We use cookies

This website uses cookies to meausure and improve your user experience.