Vera Zeng

Data Cleaning
- Steps
- Advantages Compared to the Original Dataframe
Grouping Stocks
- Methods
Key Takeaways

Recently, I worked on a side project involving stock price prediction. The dataset included over 7,000 stocks, each with volume, open, close, low, and high values over 124 days. The task was to predict the stock prices.

The data is like the following:

index	date	open	high	low	close	ticker
0	2024-01-30	18.8	18.87	18.67	18.77	EkEuXFUQrK
1	2024-04-04	20.02	20.22	20.02	20.22	EkEuXFUQrK
2	2024-04-04	10.24	10.24	10.24	10.24	ZL4m1dXqNc
3	2024-04-03	10.3	10.3	10.3	10.3	ZL4m1dXqNc
4	2024-04-04	180.2	184.43	179.75	181.57	31cWpkAP9J
5	2024-05-20	153.97	154.95	153.465	154.64	KLZI5RbdeM
6	2024-01-30	10.17	10.17	10.17	10.17	ZL4m1dXqNc

Data Cleaning

Steps

This is a typical time series prediction problem. The first step is to clean the data:

Remove stocks with wrong price: Delete stocks where the close and open prices are not between the high and low prices, or where the high price is not higher than the low price.

Handle Missing Dates: Identify and fill in missing dates to ensure continuity. The code for this step is:

 df = df.reset_index()

 all_weekdays = pd.date_range(start=df['date'].min(), end=df['date'].max(), freq='B')

 df_reset = pd.DataFrame(index=all_weekdays)

 columns_of_interest = ['open', 'close', 'low', 'high', 'volume']

 for column in columns_of_interest:
     pivoted_data = df.pivot(index='date', columns='ticker', values=column)
     pivoted_data = pivoted_data.reindex(all_weekdays)
     pivoted_data.columns = pivoted_data.columns + f'_{column}'
     df_reset = df_reset.join(pivoted_data)

 df_reset = df_reset.fillna(method='ffill').fillna(method='bfill')

This code is to create new rows for missing date, if there is no rows for those missing date. and fill in the missing value from the previous rows.
It also re-organizes the data by date and ticker. Therefore, the data frame becomes:

date	00CqTIPkWB_open	00rG1ntZCy_open	01D9vjijJc_open	01JYWTck2S_open	02CJDrjOPG_open	02lB6TpAnx_open	03Y9Vmjntv_open	03hhtNepMx_open	04PdnOi6ep_open	04zCBWEeqA_open	…	zuABY45Hmn_volume	zuMTVYqSCS_volume	zuOGBEVxSJ_volume	zwcn5c7GcS_volume	zwvlwVkes9_volume	zxZ2JdtJNr_volume	zxh1cwiuqY_volume	zxvRCQbTJl_volume	zzvNRhUkJG_volume
2024-01-02	67.260002	2.75	11.80	3.93	6.41	4.000	20.600000	5.27	0.011	17.889999	…	66616.0	3466.0	68645.046875	4713264.0	109779.203125	3746.0	40389.0	2958.0	418608.0
2024-01-03	65.300003	2.77	11.80	3.98	6.22	3.960	20.080000	5.15	0.011	17.889999	…	65069.0	1170.0	60931.101562	3674418.0	107819.000000	29497.0	18171.0	1703.0	625632.0
2024-01-04	64.779999	2.72	11.80	4.04	5.63	3.688	19.520000	5.14	0.011	17.790001	…	72177.0	2111.0	65566.203125	2703312.0	75677.796875	38105.0	62145.0	1949.0	317535.0

Advantages Compared to the Original Dataframe

Easier Filtering: With the reorganized format, you can easily filter out stocks based on their ticker symbols. This makes it simpler to isolate and analyze individual stocks.
Time Series Visualization: The data is organized by date, making it straightforward to visualize the time series for each stock. This helps in identifying trends and patterns over time.

Compared to the original data frame, which has date and ticker columns in a table, the reorganized format allows for more efficient filtering and analysis of individual stocks over time.

Visualize it by Candle Graph

Based on the candle graph standard, we can randomly selected some stocks and visualize it by candle plot

import plotly.graph_objects as go
x = '00CqTIPkWB'
all_columns = ['open', 'close', 'low', 'high', 'volume']
x_pd = df_reset[[f'{x}_{i}' for i in all_columns]]
x_pd.head()

fig = go.Figure()
for i in range(len(x_pd)):
    fig.add_trace(
        go.Candlestick(
            x = [x_pd.index[i]], 
            open = [x_pd[f'{x}_open'][i]],
            close = [x_pd[f'{x}_close'][i]],
            low = [x_pd[f'{x}_low'][i]],
            high = [x_pd[f'{x}_high'][i]],
            increasing = dict(line = dict(color = x_pd['color'][i]), fillcolor = x_pd['fill'][i]),
            decreasing = dict(line = dict(color = x_pd['color'][i]), fillcolor = x_pd['fill'][i]),
            yaxis = 'y', 
            showlegend = False
        )
    )

    fig.add_trace(
        go.Bar(
            x = [x_pd.index[i]], 
            y = [x_pd[f'{x}_volume'][i]],
            text = [x_pd[f'{x}_volume'][i]],
            marker_line_color = x_pd['color'][i],
            marker_color = x_pd['fill'][i],
            texttemplate = '%{text:.2s}',
            yaxis = 'y2', 
            showlegend = False
        )
    )

fig.update_layout(
    yaxis = dict(title='Price'),
    yaxis2 = dict(title='Volume', overlaying='y', side='right')
)

fig.show()

Grouping Stocks

Methods

With over 7,000 stocks, it is essential to group them, then doing prediction fore every group. Based on Gemini suggestion, annual volatility and annual price are two features used for classification.

all_columns = df_reset.columns.tolist() 
selected_columns = []
# Ensure 'close' is included (if not already)
for cc in all_columns:
    if 'close' in cc:
        selected_columns.append(cc)

# Select the columns from the DataFrame
close_xtrain = df_reset[selected_columns]
daily_returns = close_xtrain.pct_change().dropna()

annual_returns = daily_returns.mean(axis = 0) * 252
annual_returns.columns = ['annual_returns']

annual_volatility = daily_returns.std(axis = 0) * np.sqrt(252)
annual_volatility.columns = ['annual_volatility']


# Create a DataFrame with the data
data = {'annual_returns': annual_returns, 'annual_volatility': annual_volatility}
stock_data = pd.DataFrame(data)

Here I would like to recommand to use K-means and GMM methods to do groups:

K-Means Clustering:

Reason:
- Simple to implement and can handle large datasets
- Can used to identify the outliners
Method: Use the elbow method to find the optimal number of clusters, then filter the outliners and group the data. The code is as follows:

 # features = stock_data[['annual_return', 'annual_volatility', 'vwdr']]
 features = stock_data[['annual_returns', 'annual_volatility', ]]
 wcss = []
 for i in range(1, 11):
     kmeans = KMeans(n_clusters=i)
     kmeans.fit(features)
     wcss.append(kmeans.inertia_)
        
 # Plotting the Elbow Method graph
 plt.figure(figsize=(8,5))
 plt.plot(range(1, 11), wcss, marker='o')
 plt.xlabel('Number of Clusters')
 plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
 plt.title('The Elbow Method')
 plt.show()

Here can filter the outliners based on annual_returns and annual_volatility:

 extreme_annual_returns = 15
 extreme_annual_volatility = 30
 stock_data_updated = stock_data[(stock_data['annual_returns'] <= extreme_annual_returns) & ( stock_data['annual_volatility'] <= extreme_annual_volatility )]

then we set num_clusters = 3. rerun the kmeans method to get the grouping result:

 kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
 stock_data['cluster'] = kmeans.fit_predict(features)

 # Plot clusters
 plt.figure(figsize=(10, 6))
 plt.scatter(stock_data['annual_volatility'], stock_data['annual_returns'], \
             c=stock_data['cluster'], cmap='viridis')
 plt.xlabel('Annualized Returns')
 plt.ylabel('Annualized Volatility')
 plt.title('Stock Clusters Based on Returns and Volatility')
 plt.colorbar(label='Cluster')
 plt.show()

Gaussian Mixture Model (GMM):
- Reason: Similar to K-Means but clusters data into ellipses rather than rectangles, based on mean, std and probability of this group.
- Result

Key Takeaways

Data Cleaning: Ensuring data integrity by removing inconsistencies and handling missing dates.
Data Visualization: Using candle graphs to visualize stock price movements.
Clustering: Grouping stocks based on annual returns and volatility to identify patterns and make predictions.

By following these steps, we can better understand stock behavior and make more informed predictions. This approach can be extended and refined for more complex financial analysis and forecasting tasks.

Finally, save all the results and models for future reference and further analysis.

copied = stock_data_updated.copy()
copied.reset_index(inplace=True)
copied.iloc[:, 0] = copied.apply(lambda x: x['index'].split('_')[0], axis=1)

copied.set_index('index', inplace=True)
copied[['cluster', 'gmm_cluster']].to_csv('./data/grouped_stocks.csv')

Thanks for following along!