I currently follow along Andrew Ng'sMachine Learning Course on Coursera and wanted to implement the gradient descent algorithm in python3
using numpy
and pandas
.
This is what I came up with:
import os import numpy as np import pandas as pd def get_training_data(path): # path to read data from raw_panda_data = pd.read_csv(path) # append a column of ones to the front of the data set raw_panda_data.insert(0, 'Ones', 1) num_columns = raw_panda_data.shape[1] # (num_rows, num_columns) panda_X = raw_panda_data.iloc[:,0:num_columns-1] # [ slice_of_rows, slice_of_columns ] panda_y = raw_panda_data.iloc[:,num_columns-1:num_columns] # [ slice_of_rows, slice_of_columns ] X = np.matrix(panda_X.values) # pandas.DataFrame -> numpy.ndarray -> numpy.matrix y = np.matrix(panda_y.values) # pandas.DataFrame -> numpy.ndarray -> numpy.matrix return X, y def compute_mean_square_error(X, y, theta): summands = np.power(X * theta.T - y, 2) return np.sum(summands) / (2 * len(X)) def gradient_descent(X, y, learning_rate, num_iterations): num_parameters = X.shape[1] # dim theta theta = np.matrix([0.0 for i in range(num_parameters)]) # init theta cost = [0.0 for i in range(num_iterations)] for it in range(num_iterations): error = np.repeat((X * theta.T) - y, num_parameters, axis=1) error_derivative = np.sum(np.multiply(error, X), axis=0) theta = theta - (learning_rate / len(y)) * error_derivative cost[it] = compute_mean_square_error(X, y, theta) return theta, cost
This is how one could use the code:
X, y = get_training_data(os.getcwd() + '/data/data_set.csv') theta, cost = gradient_descent(X, y, 0.008, 10000) print('Theta: ', theta) print('Cost: ', cost[-1])
Where data/data_set.csv
could contain data (model used: 2 + x1 - x2 = y
) looking like this:
x1, x2, y 0, 1, 1 1, 1, 2 1, 0, 3 0, 0, 2 2, 4, 0 4, 2, 4 6, 0, 8
Output:
Theta: [[ 2. 1. -1.]] Cost: 9.13586056551e-26
I'd especially like to get the following aspects of my code reviewed:
- Overall
python
style. I'm relatively new topython
coming from aC
background and not sure if I'm misunderstanding some concepts here. numpy
/pandas
integration. Do I use these packages correctly?- Correctness of the gradient descent algorithm.
- Efficiency. How can I further improve my code?
np.zeros
to initializetheta
andcost
in your gradient descent function, in my opinion it is clearer. Also why uppercase X and lowercase y? I would make them consistent and perhaps even give them descriptive names, e.g.input
andoutput
. Finally, you could look into exceptions handling e.g. for bad input data from pandas or invalid values forlearning_rate
ornum_iterations
.\$\endgroup\$theta = np.zeros_like(X)
if you would like to initializetheta
with an array of zeros with dimensions ofX
.\$\endgroup\$theta
doesn't have the same dimensions asX
. Regardless I'll keep thenp.zeros_like(...)
function in the back of my head.\$\endgroup\$