In this second part of our search for a model that can identify the largest moving stocks we are going to extract and analyze the data that will act as our predictor.

Before launching ourselves head against full and complete market data we are going to develop a data prototype with a small subset of all possible data in terms of available companies (width) and available past data (depth). We are taking 10 randomly selected stocks from the long pseudo-SP500 list that was shown __in the previous post__ in this series:

```
self = QuantBook()
start = datetime(2010,1,1)
end = datetime(2013,1,1)
# Random Choices from SP500 Tickers
tickers = ['EQIX', 'AEP', 'BMY', 'HRB', 'MTB', 'BKNG', 'IP', 'BXP', 'INTC', 'TWTR']
symbols = {}
for ticker in tickers:
symbols[ticker] = self.AddEquity(ticker).Symbol
```

The stocks are selected randomly, so that __Twitter__ (TWTR) appears on the list, and this company went public at the end of 2013 so that there will not be much 2010 to 2015 data available. No problem, we are going to keep our random sample and use the stocks that yield sufficient data for this period. Unusable companies will be dropped automatically once we start cleaning up the data.

The following code will replicate the targets acquisition procedure from the previous post, in a condensed manner, all explanations are i__n the previous post__:

```
daily_history = self.History(list(symbols.values()), start, end, Resolution.Daily).round(2)
daily_history['1D_Return'] = daily_history.groupby('symbol')['close'].pct_change(1).dropna()
daily_returns = daily_history[['1D_Return']].dropna()
daily_returns = daily_returns.unstack(0)
daily_returns.columns = daily_returns.columns.droplevel()
tops = daily_returns.idxmax(axis=1)
flops = daily_returns.idxmin(axis=1)
inv_map = {str(v.ID): k for k, v in symbols.items()}
np.vectorize(inv_map.get)(tops)
np.vectorize(inv_map.get)(flops)
symbol_idxs = set(daily_history.index.get_level_values(0))
targets = pd.DataFrame(0, index=tops.index, columns = symbol_idxs)
for date in targets.index:
if daily_returns.loc[date,tops[date]]>0:
targets.loc[date][tops[date]] = 1
if daily_returns.loc[date,flops[date]]<0:
targets.loc[date][flops[date]] = -1
targets.rename(columns = inv_map, inplace = True)
flops = targets.replace(1,0).replace(-1,1)
tops = targets.replace(-1,0)
```

As our predictor data we will need a set that is at least one scale below the target data scale, that is, if our targets are, in this case, daily tops and flops we will use the previous day´s hourly or minute data in an effort to find the elusive price action structures that generate large, market shaking movements. This constitutes a new history request at a different resolution. For an easier handling of data in the future the creation of separate columns for day and time is very convenient, as we will not need the complete datetime object (at least initially). Using just dates can simplify index calls for this dataframe. The date component of the complete index can be accessed applying the "at" function from operator library:

```
from operator import itemgetter as at
history = self.History(list(symbols.values()), start, end, Resolution.Hour).round(2)
history['index_values'] = history.index
raw_data = history
raw_data['date'] = history['index_values'].apply(at(1)).dt.date
raw_data['hour'] = history['index_values'].apply(at(1)).dt.time
```

We obtain this somehow difficult to read dataframe showing the symbol, time, price and volume values. The dataframe is at hourly resolution (Resolution.Hour), the minimum native resolution we can achieve below daily resolution. This will yield, normally, 7 data points in a trading session, 10:00 stamped data to 16:00 stamped data. The short trading days before holidays will have just 4 data points and will have to be dropped when we process each of the obtained samples.

This dataframe can be cleaned to a really "raw" data dataframe that only contains the values we are interested in, anything that will form part of our analysis has to be in the 'features' list:

```
raw_data.reset_index(inplace = True)
raw_data.set_index(['symbol','date', 'hour'], inplace=True)
features = ['close', 'volume']
raw_data = raw_data[features]
raw_data.reset_index(inplace=True)
```

This generates a simpler dataframe with the information we need:

The dataframe will be saved as a source that can be easily modified for different symbols and features. We save it to obtain the real samples that constitute data arrays for use in our machine learning models, whatever the model we decide to use this will be the most convenient format to handle:

```
samples = raw_data.set_index(['symbol', 'date', 'hour'])
samples = pd.DataFrame(samples.groupby(['symbol', 'date']).apply(lambda x: np.stack(x.values)))
samples.columns = ['Samples']
```

These are now the samples we have:

Each sample can be accessed by its indexing symbol and date pair, checking the last value in the raw data dataframe returns the following numpy array:

`samples.loc[('INTC R735QTJ8XC9X','2012-12-31')]['Samples']`

The dimensions of this array are 7x2, we obtained 7 hourly data points (a normal trading day) and two time series, price and volume. Price and volume array data for the last available day for __Intel Corporation__ (INTC ) match raw data values.

We can apply transformations to this data, in this case we are going to use __sklearn preprocessing tools__ to scale and normalize all the samples independently, along the columns of each sample. We are using a lambda functions list to iteratively apply any transformations that we may need so that modifying the "scalers" list allows for different data transformations. We can also slice the list of lambda functions to apply the functions we need at "use_n". It is also possible that the precision we obtain from scalers is unnecessarily high, so we round all the dataframe values to 4 decimal places using "applymap":

```
from sklearn.preprocessing import normalize, scale, MinMaxScaler
minmaxscaler = MinMaxScaler()
scalers = [lambda x: minmaxscaler.fit_transform(x),
lambda x: scale(x,axis=0),
lambda x: normalize(x,axis=0)]
scaled_samples = samples
use_n = 1
for lambda_f in scalers[0:use_n]:
scaled_samples = pd.DataFrame(scaled_samples.Samples.apply(lambda_f))
scaled_samples = scaled_samples.applymap(lambda x: np.around(x,4))
```

The shape of the dataframe is not very convenient as we need to match our features in "samples" dataframe with our "targets" dataframe, "samples" dataframe can be pivoted in such a way to index it by date:

`samples = scaled_samples.reset_index().pivot(columns = 'symbol', index='date', values='Samples')`

Inspecting the first three column of the pivoted dataframe shows us its final shape:

These samples are missing target values, we have to join them and then clean unusual samples for trading days that are short and have less than 7 hourly data points. Finding the length of data for all days for the first company should be sufficient at this point. In the event that we find problems to perform this operation we can always use the trading calendar and directly remove short days at the history obtention statement dropping all holiday eve days. Targets have to be shifted up by one day, so that we have the target tops and flops of the next day aligned to the current day:

```
samples = pd.concat([samples, targets], axis=1).dropna()
samples['lens'] = samples[samples.columns[0]].apply(lambda x: len(x))
samples=samples[samples['lens'] == 7]
samples.drop(columns = ['lens'], inplace=True)
samples[targets.columns] = samples[targets.columns].shift(-1)
samples.dropna(inplace = True)
```

It is always good to spot check that the features match the next target day and not the current one, as predicting the current day tops and flops with the full daily data in our hands lacks predictive power.

The dataset is ready to be fed into machine learning models, we also have somewhat of a pipeline to make scaling or targeting modifications. We should check that this data is valid for at least a very simple neural network that will act as a spot check. Before feeding the data to the mini-model it has to be split into training and testing sets (we aim to validate with backtesting in the final steps) and we can take the opportunity to dimension the data as we wish. We can set up a flexible data dimensioning and splitting function. For this first spot model let´s generate an uncreative 1D vector from each observation date. This 1D row vector contains a sequence of hourly prices-volumes pairs sequentially stacked for each symbol, we are probably destroying some information due to confusing adjacency of data that can be recovered during the model definition. There is a high probability that a good dimensional composition of the data coupled with convolutional layers or long short-term memory layers in a neural network will provide better predictive capabilities than a 1D array representation:

```
from sklearn.model_selection import train_test_split
dimensions = [-1]
test_size = 0.2
train, test = train_test_split(samples, test_size=test_size)
train_shape = (len(train),*dimensions)
test_shape = (len(test), *dimensions)
inputs = samples.columns[:len(symbols.keys())]
X_train = np.stack(np.concatenate(np.array(train[inputs].values))).reshape(train_shape)
y_train = np.array(train[targets.columns].values).reshape(len(train),-1)
X_test = np.stack(np.concatenate(np.array(test[inputs].values))).reshape(test_shape)
y_test = np.array(test[targets.columns].values).reshape(len(test),-1)
```

This will be our "mini-model" for a very quick trial on the usefulness of the data we have treated so far. A simple __Keras sequential model__ with 18 unit dense layer, dropout and output layer with one neuron for each of our possible output values. __Hyperbolic tangent activation function__ yields values between -1 and 1, so it appears very suitable to our target definition. We will modify this model as needed in the future, today we will only check if it works:

```
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
n_out = len(targets.columns)
mini_model = keras.Sequential([
layers.Dense(18, activation="tanh", name="layer1"),
layers.Dropout(0.2),
layers.Dense(n_out,activation='tanh')
])
EPOCHS = 50
patience_rate = 0.1
patience = int(EPOCHS*patience_rate)
callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience)
mini_model.fit(X_train, y_train, epochs=EPOCHS, validation_data=(X_test, y_test), callbacks = [callbacks])
```

With 50 epochs and a 10% patience call back it takes two minutes to train this network. The validation error is too large to mention. Predicting on the test dataset yields these vectors:

From these daily results we can locate the arguments of the maximum and minimum predicted values to obtain our predicted tops and flops of tomorrow, the tactics can be then to enter market positions on open accordingly. Let's get a feeling on how this mini-model is performing before we close the data generation phase. This result calculator will be used for the model selection process. Our predictions yielded the following results:

```
results = {}
for i in range(len(prediction)):
results[i] = {}
predicted_top = targets.columns[prediction[i].argmax()]
true_top = targets.columns[y_test[i].argmax()]
predicted_flop = targets.columns[prediction[i].argmin()]
true_flop = targets.columns[y_test[i].argmin()]
results[i]['Top_Pred'] = predicted_top
results[i]['Top_True'] = true_top
results[i]['Flop_Pred'] = predicted_flop
results[i]['Flop_True'] = true_flop
results = pd.DataFrame.from_dict(results, orient = 'index')
results.index = test.index
results['Top_Hit'] = results['Top_Pred'] == results['Top_True']
results['Flop_Hit'] = results['Flop_Pred'] == results['Flop_True']
results['Both_Hit'] = results['Flop_Hit'] & results['Top_Hit']
results['None_Hit'] = ~(results['Flop_Hit'] | results['Top_Hit'])
```

This is the results dataframe:

Lots of misses, not many hits, those can be summarized in another dataframe to act as accuracy result summary for the mini-model:

```
hit_cols = [column for column in results.columns if 'Hit' in column]
acc = {}
for col in hit_cols:
accuracy = results[col].value_counts(normalize=True)[True].round(2)
acc[col] = accuracy
acc = pd.DataFrame.from_dict(acc, orient='index', columns = ['Accuracy'])
```

Using the mini-model we will not correctly identify neither the tops nor the flops 70% of the days. The accuracy is very, very low and can be possibly improved by modelling. We are still acting on the premise that previous day price-volume data contains information on next day´s price moves. There are multiple possibilities to explore with the model that we will cover in our next posts, expecting to improve these accuracy values and generating usesable trading signals.

Remember that information in __ostirion.net__ does not constitute financial advice, we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. If you are in need of algorithmic model development, deployment, verification or validation do not hesitate and __contact us__. We will be also glad to help you with your predictive machine learning or artificial intelligence challenges.

## Comentarios