The raw data was stored in a Bronze table containing information such as ticker symbol, datetime, open, close, high, low and volume.
YAHOO CRYPTOCURRENCY TICKER DOWNLOAD
We used yfinance python library to download historical crypto exchange market data from Yahoo Finance's API in 15 min intervals.
A lookup table was used to hold the crypto tickers and their Twitter hashtags to facilitate the subsequent search for associated tweets. The two primary data sources were Twitter and Yahoo Finance. As a team, we played specific roles to mimic different data personas and this paradigm facilitated the seamless handoffs between data engineering, machine learning, and business intelligence roles without requiring data to be moved across systems.ĭata/ML pipeline Ingestion using a Medallion Architecture The use of the Lakehouse architecture enabled rapid acceleration of the pipeline creation to just one week. The Lakehouse paradigm combines key capabilities of Data Lakes and Data Warehouses to enable all kinds of BI and AI use cases. Runs updated SQL BI queries on the Gold Table.Computes the correlation ML model between price and sentiment.Aggregates the refined Twitter and Yahoo Finance data into an aggregated Gold Table.Cleans data and applies the Twitter sentiment machine learning model into Silver tables.
Imports the raw data into the Cryptocurrency Delta Lake Bronze tables.
YAHOO CRYPTOCURRENCY TICKER FULL
The full orchestration workflow runs a sequence of Databricks notebooks that perform the following tasks: A high-level architecture of the data and ML pipeline is presented in Figure 1 below. This makes it easier to analyze the correlation between the Tweets and crypto prices. One advantage of cryptocurrency for investors is that it is traded 24/7 and the market data is available round the clock. We would like to thank the Databricks University Alliance program and the extended team for all the support. This blog walks through how we built this ML model in just a few weeks by leveraging Databricks and its collaborative notebooks. The aggregated trends and actionable insights are presented on a Databricks SQL dashboard, allowing for easy consumption to relevant stakeholders. We leveraged the Databricks Lakehouse Platform to ingest unstructured data from Twitter using the Tweepy library and traditional structured pricing data from Yahoo Finance to create a machine learning prediction model that analyzes the impact of investor sentiment on crypto asset valuation. There have been instances where their prices were impacted on account of tweets by famous personalities.Īs part of a data engineering and analytics course at the Harvard Extension School, our group worked on a project to create a cryptocurrency data lake for different data personas – including data engineers, ML practitioners and BI analysts – to analyze trends over time, particularly the impact of social media on the price volatility of a crypto asset, such as Bitcoin (BTC). That's over a 13,000% ROI in a short span of 5 years! Even with this growth, cryptocurrencies remain incredibly volatile, with their value being impacted by a multitude of factors: market trends, politics, technology…and Twitter. The market capitalization of cryptocurrencies increased from $17 billion in 2017 to $2.25 trillion in 2021.