Kaggle.com is an excellent source of information for machine learning and artificial intelligence models. It hosts inference competitions from all sectors using multiple data types, and participants generally share their work. Thus, it is a good source of information for both beginners and veterans.
Optiver, an algorithmic market maker, has launched a competition to predict volatility from 10-minute market data chunks. We like to participate in these competitions from time to time, as they offer an excellent opportunity to benchmark the latest prediction algorithms in an open environment. Unfortunately, we could not participate fully in the SETI Breakthrough Listen competition, even if we wanted to. Operational, day-to-day requirements prevented us from working on it in earnest. This Realized Volatility Prediction competition is very in line with our work, so we can devote some time to it and feel what is being done in the outside world.
The objective of the competition is to predict with the best possible root mean square percentage error the value of the volatility for a traded instrument 10 minutes after receiving a 10-minute trades and order book history. Data, here, is formed by several parquet format files, with book and trade data forming individual files for each stock. Inside each file, there are multiple 10-minute data histories. There is a train CSV file with the ground truth for the volatility after a given period for a given stock.
We do not like the shape and order of this data; for a first prediction approach, we would like to have all the data in a single large matrix we can start trimming from. Assuming that we maintain the same data folder structure as in the Kaggle folder, we begin by importing the necessary tools and defining the fixed locations of the files we need:
import pandas as pd import numpy as np import os import glob
ob_dir = '../input/optiver-realized-volatility-prediction/book_train.parquet/' trade_dir = '../input/optiver-realized-volatility-prediction/trade_train.parquet/' ob_files = ob_dir+'*' trade_files = trade_dir+'*' ob_files = glob.glob(ob_files) trade_files = glob.glob(trade_files)
We are using 'ob' as shorthand for "order book". All the files in the order book and trade folders are now in ob_files and trade_files. We safely create a folder to hold our CSV files:
csv_dir = './optiver_csv' try: os.mkdir(csv_dir) except: pass
If there is no such folder, it will get created; otherwise, we do nothing. Next, we iterate over all our files, finding at the same time both the order book and the trades files corresponding to a given stock:
for f in ob_files: stock_id = f.split('/')[-1].split('=')[-1] book_file = '../input/optiver-realized-volatility- prediction/book_train.parquet/stock_id='+str(stock_id) trade_file = '../input/optiver-realized-volatility- prediction/trade_train.parquet/stock_id='+str(stock_id) book_df = pd.read_parquet(book_file) trade_df = pd.read_parquet(trade_file)
The identifier of the stock can be split from the name of the file after the "=" sign. The pandas module has a built-in function to read a parquet file into a dataframe, it is quite easy, and the parquet file contains all the necessary column headers. The resulting structures should be similar to these, first the trade data and the order book data second:
Both data sets capture similar events in time. The order book appears to be a continuous order book information stream, with each "second_in_bucket" having an associated price and orders status. Trade data is only present when a trade occurs. Hence, it skips many of the seconds for each bucket. In this case, a bucket is a 10-minute interval designated by the "time_id" column. There should be an order book data point for each second in the bucket belonging to each "time_id". There will not necessarily be a trade for those "time_id" and "seconds_in_bucket" pairs. We can join these two data sets by the time identifier and the second in the bucket and fill in the missing value with stable values. We can reindex both data frames by time_id and seconds_in_bucket combination with this code:
book_df['reindex'] = book_df['time_id'].astype(str) + ':' +book_df['seconds_in_bucket'].astype(str) book_df = book_df.drop(columns=['time_id', 'seconds_in_bucket']) book_df.set_index('reindex', inplace=True) trade_df['reindex'] = trade_df['time_id'].astype(str) + ':' +trade_df['seconds_in_bucket'].astype(str) trade_df = trade_df.drop(columns=['time_id', 'seconds_in_bucket']) trade_df.set_index('reindex', inplace=True)
With both data types indexed in the same manner, we can now join the trade data into the book data, assuming that we will always have more book data points than trades data points:
full_data = book_df.join(trade_df)
The resulting data frame contains multiple missing value entries, and it looks like this:
The approach that appears most sensible is to replace the missing values for the size and the order counts with zero, indicating that no trade happened for that second. For the price, as it should be, ideally, a continuous value, we can fill in the missing values with the most recent value; we will fill it forward first, then backward. Unfortunately, we have to perform this operation by time index, as the time buckets are not sequential according to the data description:
full_data['size'].fillna(0, inplace=True) full_data['order_count'].fillna(0, inplace=True) full_data['price'] = full_data.groupby(['time_id'], sort=False) ['price'].apply(lambda x: x.ffill().bfill())
We are introducing a bit of a bias here; we are filling a price with a value from the future. As a result, we do not know the price at the beginning of the time bucket. However, in a natural environment, we would possibly know this latest price value. Thus, we would have a "most recent" price to work with. Alternatively, we can drop all points with no initial price at the start of the time bucket, replacing the previous code block with:
# Alternatively, fill forward and drop. full_data['size'].fillna(0, inplace=True) full_data['order_count'].fillna(0, inplace=True) full_data['price'] = full_data.groupby(['time_id'], sort=False)['price'].apply(lambda x: x.ffill()) full_data.dropna(inplace = True)
We drop 22,000 data entries for this trial stock with this method: 2% of the data. This is possibly the better method for joining the data in a single giant set. We can always revisit this step in the future by marking it as a critical data modification point. The final form of the joined and filled data is this:
As we can see, the first bucket no longer starts with the 0 seconds in bucket value, and the index shows the original position of the time-second pair before dropping any values. Thus, we can save this data frame as a CSV file without modifying the index or resetting it, and it will probably not matter as the index has lost its meaning after this transformation.
full_data.reset_index(inplace=True) # Save csv file_name = csv_dir + '/' + stock_id + '.csv' full_data.to_csv(file_name)
The created CSV is 100MB in size, a considerable degradation from the original parquet files that were 15MB and 1MB in size. However, with 127 separate stocks of similar data quantities, we are looking at 12GB of data instead of 2GB. If we save the complete data frame as a parquet file, it will be just 25MB in size, giving us a 3GB total, more manageable size:
parquet_dir = './optiver_parquet' csv_dir = './optiver_csv' try: os.mkdir(parquet_dir) except: pass parquet_name = csv_dir + '/' + stock_id + '.parquet' full_data.to_parquet(parquet_name)
And finally, we store it into a compressed file for easy downloading:
import shutil shutil.make_archive('optiver_parquet', 'zip', parquet_dir)
The next step is to create the most straightforward model that uses all this data. There are several quick inference examples on the competition page, using partial data from the whole set. Finally, we think it is interesting to check how a learning (deep or shallow) model performs without much data or hyperparameter tweaking and understand the predictability of the volatility of these instruments.
At the date of posting, the competition will have entered the private data testing phase. Leaderboard standings should be visible on this page.
Information in ostirion.net does not constitute financial advice; we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.
The notebook for this post is located here, in Kaggle.