- Free Download: DS Career Guide
- Recent Posts
- Python Data Wrangling Tutorial Contents
- Step 1: Set up your environment.
- How to Create Your Own Cryptocurrency Blockchain in Python
- Step 2: Import libraries and dataset.
- Building a Full-Text Search App Using Docker and Elasticsearch
- How to Create Your Own Cryptocurrency Using Python
- Step 3: Understand the data.
- Creating a Cryptocurrency in Python Part 1
- Step 4: Filter unwanted observations.
- Step 5: Pivot the dataset.
- Step 6: Shift the pivoted dataset.
- Step 7: Melt the shifted dataset.
- Build Crypto Bitcoin Trading Bot with Python Binance CCXT — How To Video Tutorials with Code
- Step 8: Reduce-merge the melted data.
- Initial Structure of the Block Class
Free Download: DS Career Guide
Bitcoin and cryptocurrency have been all the rage… but as data scientists, we’re empiricists, right? We don’t want to just take others’ word for it… we want to look at the data firsthand! In this tutorial, we’ll introduce common and powerful techniques for data wrangling in Python.
Broadly speaking, data wrangling is the process of reshaping, aggregating, separating, or otherwise transforming your data from one format to a more useful one.
let’s say we wanted to run a step-forward analysis of a very rudimentary momentum trading strategy that goes as follows:
- At the start of every month, we buy the cryptocurrency that had the largest price gain over the previous 7, 14, 21, or 28 days.
We want to evaluate each of these time windows.
- Then, we hold for exactly 7 days and sell our position. Please note: this is a purposefully simple strategy that is only meant for illustrative purposes.
How well would we go about evaluating this strategy?
This is a great question for showcasing data wrangling techniques because all the hard work lies in molding your dataset into the proper format.
Once you have the appropriate analytical base table (ABT), answering the question becomes simple.
What this guide is not:
This is not a guide about investment or trading strategies, nor is it an endorsement for or against cryptocurrency.
Potential investors should form their own views independently, but this guide will introduce tools for doing so.
Again, the focus of this tutorial is on data wrangling techniques and the ability to transform raw datasets into formats that help you answer interesting questions.
A quick tip before we begin:
This tutorial is designed to be streamlined, and it won’t cover any one topic in too much detail.
It may be helpful to have the Pandas library documentation open beside you as a supplemental reference.
Python Data Wrangling Tutorial Contents
Here are the steps we’ll take for our analysis:
- Set up your environment.
- Import libraries and dataset.
- Understand the data.
- Filter unwanted observations.
- Pivot the dataset.
- Shift the pivoted dataset.
- Melt the shifted dataset.
- Reduce-merge the melted data.
- Aggregate with group-by.
Step 1: Set up your environment.
First, make sure you have the following installed on your computer:
- Python 2.7+ or Python 3
- Jupyter Notebook (optional, but recommended)
We strongly recommend installing the Anaconda Distribution, which comes with all of those packages.
How to Create Your Own Cryptocurrency Blockchain in Python
Simply follow the instructions on that download page.
Once you have Anaconda installed, simply start Jupyter (either through the command line or the Navigator app) and open a new notebook:
Python 3 or Python 2.7+ are both fine.
Step 2: Import libraries and dataset.
Let's start by importing Pandas, the best Python library for wrangling relational (i.e. table-format) datasets.
Pandas will be doing most of the heavy lifting for this tutorial.
- Tip: we'll give Pandas an alias. Later, we can invoke the library with pd.
Next, let's tweak the display options a bit.
Building a Full-Text Search App Using Docker and Elasticsearch
First, let's display floats with 2 decimal places to make tables less crowded. Don't worry...
this is only a display setting that doesn't reduce the underlying precision. Let's also expand the limits for the number of rows and columns displayed.
For this tutorial, we'll be using a price dataset managed by Brave New Coin and distributed on Quandl.
How to Create Your Own Cryptocurrency Using Python
The full version tracks price indices for 1,900+ fiat-crypto trading pairs, but it requires a premium subscription, so we've provided a small sample with a handful of cryptocurrencies.
To follow along, you can download BNC2_sample.csv.
Clicking that link will take you to Google Drive, and then simply click the download icon in the top right:
Once you've downloaded the dataset and put in the same file directory as your Jupyter notebook, you can run the following code to read the dataset into a Pandas dataframe and display example observations.
Note that we use the names= argument for pd.read_csv() to set our own column names because the original dataset does not have any.
Data Dictionary (for code GWA_BTC):
- Date: The day on which the index values were calculated.
- Open: The day's opening price index for Bitcoin in US dollars.
- High: The highest value for the price index for Bitcoin in US dollars that day.
- Low: The lowest value for the price index for Bitcoin in US dollars that day.
- Close: The day's closing price index for Bitcoin in US dollars.
- Volume: The volume of Bitcoin traded that day.
- VWAP: The volume weighted average price of Bitcoin traded that day.
- TWAP: The time-weighted average price of Bitcoin traded that day.
Step 3: Understand the data.
One of the most common reasons to wrangle data is when there's "too much" information packed into a single table, especially when dealing with time series data.
Generally, all observations should be equivalent in granularity and in units.
There will be exceptions, but for the most part, this rule of thumb can save you from many headaches.
- Equivalence in Granularity - For example, you could have 10 rows of data from 10 different cryptocurrencies.
However, you should not have an 11th row with average or total values from the other 10 rows.
That 11th row would be an aggregation, and thus not equivalent in granularity to the other 10.
- Equivalence in Units - You could have 10 rows with prices in USD collected at different dates. However, you should not then have another 10 rows with prices quoted in EUR.
Creating a Cryptocurrency in Python Part 1
Any aggregations, distributions, visualizations, or statistics would become meaningless.
Our current raw dataset breaks both of these rules!
Data stored in CSV files or databases are often in “stacked” or “record” format. They use a single 'Code' column as a catch-all for metadata. For example, in the sample dataset, we have the follow codes:
First, see how some codes start with GWA and others with MWA?
These are actually completely different types of indicators according to the documentation page.
- MWA stands for "market-weighted average," and they show regional prices. There are multiple MWA codes for each cryptocurrency, one for each local fiat currency.
- On the other hand, GWA stands for "global-weighted average," which shows globally indexed prices. GWA is thus an aggregation of MWA and not equivalent in granularity.
(Note: only a subset of regional MWA codes are included in the sample dataset.)
For instance, let's look at Bitcoin's codes on the same date:
As you can see, we have multiple entries for a cryptocurrency on a given date.
To further complicate things, the regional MWA data are denominated in their local currency (i.e.
nonequivalent units), so you would also need historical exchange rates.
Having different levels of granularity and/or different units makes analysis unwieldy at best, or downright impossible at worst.
Luckily, once we've spotted this issue, fixing it is actually trivial!
Step 4: Filter unwanted observations.
One of the simplest yet most useful data wrangling techniques is removing unwanted observations.
In the previous step, we learned that GWA codes are aggregations of the regional MWA codes.
Therefore, to perform our analysis, we only need to keep the global GWA codes:
Now that we only have GWA codes left, all of our observations are equivalent in granularity and in units.
We can confidently proceed.
Step 5: Pivot the dataset.
Next, in order to analyze our momentum trading strategy outlined above, for each cryptocurrency, we'll need calculate returns over the prior 7, 14, 21, and 28 days... for the first day of each month.
However, it would be a huge pain to do so with the current "stacked" dataset.
It would involve writing helper functions, loops, and plenty of conditional logic. Instead, we'll take a more elegant approach....
First, we'll pivot the dataset while keeping only one price column.
For this tutorial, let's keep the VWAP (volume weighted average price) column, but you could make a good case for most of them.
As you can see, each column in our pivoted dataset now represents the price for one cryptocurrency and each row contains prices from one date.
All the features are now aligned by date.
Step 6: Shift the pivoted dataset.
To easily calculate returns over the prior 7, 14, 21, and 28 days, we can use Pandas's shiftmethod.
This function shifts the index of the dataframe by some number of periods.
For example, here's what happens when we shift our pivoted dataset by 1:
Notice how the shifted dataset now has values from 1 day before? We can take advantage of this to calculate prior returns for our 7, 14, 21, 28 day windows.
For example, to calculate returns over the 7 days prior, we would need prices_today/prices_7_days_ago-1.0, which translates to:
Calculating returns for all of our windows is as easy as writing a loop and storing them in a dictionary:
Note: Calculating returns by shifting the dataset requires 2 assumptions to be met: (1) the observations are sorted ascending by date and (2) there are no missing dates. We checked this "off-stage" to keep this tutorial concise, but we recommend confirming this on your own.
Step 7: Melt the shifted dataset.
Now that we've calculated returns using the pivoted dataset, we're going to "unpivot" the returns.
By unpivoting, or melting the data, we can later create an analytical base table (ABT) where each row contains all of the relevant information for a particular coin on a particular date.
We couldn't directly shift the original dataset because the data for different coins were stacked on each other, so the boundaries would've overlapped.
In other words, BTC data would leak into ETH calculations, ETH data would leak into LTC calculations, and so on.
To melt the data, we'll...
- reset_index() so we can call the columns by name.
- Call the melt() method.
- Pass the column(s) to keep into the id_vars= argument.
- Name the melted column using the value_name= argument.
Here's how that looks for one dataframe:
To do so for all of the returns dataframes, we can simply loop through delta_dict, like so:
Finally, we can create another melted dataframe that contains the forward-looking 7-day returns.
Build Crypto Bitcoin Trading Bot with Python Binance CCXT — How To Video Tutorials with Code
This will be our "target variable" for evaluating our trading strategy.
Simply shift the pivoted dataset by -7 to get "future" prices, like so:
We now have 5 melted dataframes stored in the melted_dfs list, one for each of the backward-looking 7, 14, 21, and 28-day returns and one for the forward-looking 7-day returns.
Step 8: Reduce-merge the melted data.
All that's left to do is join our melted dataframes into a single analytical base table.
We'll need two tools.
The first is Pandas's merge function, which works like SQL JOIN. For example, to merge the first two melted dataframes...
See how we now have delta_7 and delta_14 in the same row?
This is the start of our analytical base table. All we need to do now is merge all of our melted dataframes together with a base dataframe of other features we might want.
The most elegant way to do this is using Python's built-in reduce function. First we'll need to import it:
Next, before we use that function, let's create a feature_dfs list that contains base features from the original dataset plus the melted datasets.
Now we're ready to use the reduce function.
Reduce applies a function of two arguments cumulatively to the objects in a sequence (e.g.
Initial Structure of the Block Class
a list). For example, reduce(lambdax,y:x+y,[1,2,3,4,5]) calculates ((((1+2)+3)+4)+5).
Thus, we can reduce-merge all of the features like so:
Data Dictionary for our Analytical Base Table (ABT):
- Date: The day on which the index values were calculated.
- Code: Which cryptocurrency.
- VWAP: The volume weighted average price traded that day.
- delta_7: Return over the prior 7 days (1.0 = 100% return).
- delta_14: Return over the prior 14 days (1.0 = 100% return).
- delta_21: Return over the prior 21 days (1.0 = 100% return).
- delta_28: Return over the prior 28 days (1.0 = 100% return).
- return_7: Future return over the next 7 days (1.0 = 100% return).
By the way, notice how the last 7 observations don't have values for the 'return_7' feature?
This is expected, as we cannot calculate "future 7-day returns" for the last 7 days of the dataset.
Technically, with this ABT, we can already answer our original objective. For example, if we wanted to pick the coin that had the biggest momentum on September 1st, 2017, we could simply display the rows for that date and look at the 7, 14, 21, and 28-day prior returns:
# Pandas for managing datasets
# Display floats with 2 decimal places
# Expand display limits
# Read BNC2 sample dataset
# Display first 5 observations
# Unique codes in the dataset
# ['GWA_BTC' 'GWA_ETH' 'GWA_LTC' 'GWA_XLM' 'GWA_XRP' 'MWA_BTC_CNY'
# 'MWA_BTC_EUR' 'MWA_BTC_GBP' 'MWA_BTC_JPY' 'MWA_BTC_USD' 'MWA_ETH_CNY'
# 'MWA_ETH_EUR' 'MWA_ETH_GBP' 'MWA_ETH_JPY' 'MWA_ETH_USD' 'MWA_LTC_CNY'
# 'MWA_LTC_EUR' 'MWA_LTC_GBP' 'MWA_LTC_JPY' 'MWA_LTC_USD' 'MWA_XLM_CNY'
# 'MWA_XLM_EUR' 'MWA_XLM_USD' 'MWA_XRP_CNY' 'MWA_XRP_EUR' 'MWA_XRP_GBP'
# 'MWA_XRP_JPY' 'MWA_XRP_USD']
# Example of GWA and MWA relationship
# Number of observations in dataset
# Before: 31761
# Get all the GWA codes
# Only keep GWA observations
# Number of observations left
# After: 6309
# Pivot dataset
# Display examples from pivoted dataset
# Code GWA_BTC GWA_ETH GWA_LTC GWA_XLM GWA_XRP
# 2018-01-21 12,326.23 1,108.90 197.36 0.48 1.55
# 2018-01-22 11,397.52 1,038.21 184.92 0.47 1.43
# 2018-01-23 10,921.00 992.05 176.95 0.47 1.42
# Code GWA_BTC GWA_ETH GWA_LTC GWA_XLM GWA_XRP
# 2018-01-21 nan nan nan nan nan
# 2018-01-22 12,326.23 1,108.90 197.36 0.48 1.55
# 2018-01-23 11,397.52 1,038.21 184.92 0.47 1.43
# Calculate returns over 7 days prior
# Display examples
# Calculate returns over each window and store them in dictionary
# Melt delta_7 returns
# Melted dataframe examples
# Melt all the delta dataframes and store in list
# Calculate 7-day returns after the date
# Melt the return dataset and append to list
# Merge two dataframes
# Grab features from original dataset
# Create a list with all the feature dataframes
# Reduce-merge features into analytical base table
# Display examples from the ABT
# Data from Sept 1st, 2017