5. Gradient Descent

Those who boast of their descent, brag on what they owe to others. Seneca

Frequently when doing data science, we’ll be trying to the find the best model for a certain situation. And usually “best” will mean something like “minimizes the error of the model” or “maximizes the likelihood of the data.” In other words, it will represent the solution to some sort of optimization problem.

This means we’ll need to solve a number of optimization problems. And in particular, we’ll need to solve them from scratch. Our approach will be a technique called gradient descent, which lends itself pretty well to a from-scratch treatment. You might not find it super exciting in and of itself, but it will enable us to do exciting things throughout the book, so bear with me.


The Idea Behind Gradient Descent

Suppose we have some function f that takes as input a vector of real numbers and outputs a single real number. One simple such function is:

def sum_of_squares(v):

"""computes the sum of squared elements in v"""

return sum(v_i ** 2 for v_i in v)

We’ll frequently need to maximize (or minimize) such functions. That is, we need to find the input v that produces the largest (or smallest) possible value.

For functions like ours, the gradient (if you remember your calculus, this is the vector of partial derivatives) gives the input direction in which the function most quickly increases. (If you don’t remember your calculus, take my word for it or look it up on the Internet.)

Accordingly, one approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i.e., the direction that causes the function to increase the most), and repeat with the new starting point.

Similarly, you can try to minimize a function by taking small steps in the opposite

direction, as shown in Figure 8-1.






Figure 8-1. Finding a minimum using gradient descent







Estimating the Gradient

If f is a function of one variable, its derivative at a point x measures how f(x) changes when we make a very small change to x. It is defined as the limit of the difference quotients:

def difference_quotient(f, x, h):

return (f(x + h) - f(x)) / h

as h approaches zero.

(Many a would-be calculus student has been stymied by the mathematical definition of limit. Here we’ll cheat and simply say that it means what you think it means.)






Figure 8-2. Approximating a derivative with a difference quotient

The derivative is the slope of the tangent line at , while the difference quotient is the slope of the not-quite-tangent line that runs through . As h gets smaller and smaller, the not-quite-tangent line gets closer and closer to the tangent line (Figure 8-2).

For many functions it’s easy to exactly calculate derivatives. For example, the square

function:


def square(x):

return x * x

has the derivative:

def derivative(x):

return 2 * x

which you can check — if you are so inclined — by explicitly computing the difference quotient and taking the limit.

What if you couldn’t (or didn’t want to) find the gradient? Although we can’t take limits in Python, we can estimate derivatives by evaluating the difference quotient for a very small e. Figure 8-3 shows the results of one such estimation:

derivative_estimate = partial(difference_quotient, square, h=0.00001)

# plot to show they're basically the same

import matplotlib.pyplot as plt

x = range(-10,10)

plt.title("Actual Derivatives vs. Estimates")

plt.plot(x, map(derivative, x), 'rx', label='Actual') # red x plt.plot(x, map(derivative_estimate, x), 'b+', label='Estimate') # blue + plt.legend(loc=9)

plt.show()






Figure 8-3. Goodness of difference quotient approximation


When f is a function of many variables, it has multiple partial derivatives, each indicating how f changes when we make small changes in just one of the input variables.

We calculate its ith partial derivative by treating it as a function of just its ith variable, holding the other variables fixed:

def partial_difference_quotient(f, v, i, h):

"""compute the ith partial difference quotient of f at v"""

w = [v_j + (h if j == i else 0) # add h to just the ith element of v

for j, v_j in enumerate(v)]

return (f(w) - f(v)) / h

after which we can estimate the gradient the same way:

def estimate_gradient(f, v, h=0.00001):

return [partial_difference_quotient(f, v, i, h)

for i, _ in enumerate(v)]







Using the Gradient

It’s easy to see that the sum_of_squares function is smallest when its input v is a vector of zeroes. But imagine we didn’t know that. Let’s use gradients to find the minimum among all three-dimensional vectors. We’ll just pick a random starting point and then take tiny steps in the opposite direction of the gradient until we reach a point where the gradient is very small:

def step(v, direction, step_size):

"""move step_size in the direction from v"""

return [v_i + step_size * direction_i

for v_i, direction_i in zip(v, direction)]

def sum_of_squares_gradient(v):

return [2 * v_i for v_i in v]

# pick a random starting point

v = [random.randint(-10,10) for i in range(3)] tolerance = 0.0000001

while True:

gradient = sum_of_squares_gradient(v) # compute the gradient at v next_v = step(v, gradient, -0.01) # take a negative gradient step if distance(next_v, v) < tolerance: # stop if we're converging

break

v = next_v # continue if we're not

If you run this, you’ll find that it always ends up with a v that’s very close to [0,0,0]. The smaller you make the tolerance, the closer it will get.


Choosing the Right Step Size

Although the rationale for moving against the gradient is clear, how far to move is not. Indeed, choosing the right step size is more of an art than a science. Popular options include:

Using a fixed step size

Gradually shrinking the step size over time

At each step, choosing the step size that minimizes the value of the objective function

The last sounds optimal but is, in practice, a costly computation. We can approximate it by trying a variety of step sizes and choosing the one that results in the smallest value of the objective function:

step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]

It is possible that certain step sizes will result in invalid inputs for our function. So we’ll need to create a “safe apply” function that returns infinity (which should never be the minimum of anything) for invalid inputs:

def safe(f):

"""return a new function that's the same as f,

except that it outputs infinity whenever f produces an error"""

def safe_f(*args, **kwargs):

try:

return f(*args, **kwargs)

except:

return float('inf') # this means "infinity" in Python

return safe_f


Putting It All Together

In the general case, we have some target_fn that we want to minimize, and we also have its gradient_fn. For example, the target_fn could represent the errors in a model as a function of its parameters, and we might want to find the parameters that make the errors as small as possible.

Furthermore, let’s say we have (somehow) chosen a starting value for the parameters

theta_0. Then we can implement gradient descent as:

def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):

"""use gradient descent to find theta that minimizes target function"""

step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]

theta = theta_0 # set theta to initial value

target_fn = safe(target_fn) # safe version of target_fn

value = target_fn(theta) # value we're minimizing

while True:

gradient = gradient_fn(theta)

next_thetas = [step(theta, gradient, -step_size)

for step_size in step_sizes]

# choose the one that minimizes the error function next_theta = min(next_thetas, key=target_fn) next_value = target_fn(next_theta)

# stop if we're "converging"

if abs(value - next_value) < tolerance:

return theta

else:

theta, value = next_theta, next_value

We called it minimize_batch because, for each gradient step, it looks at the entire data set (because target_fn returns the error on the whole data set). In the next section, we’ll see an alternative approach that only looks at one data point at a time.

Sometimes we’ll instead want to maximize a function, which we can do by minimizing its negative (which has a corresponding negative gradient):

def negate(f):

"""return a function that for any input x returns -f(x)"""

return lambda *args, **kwargs: -f(*args, **kwargs)

def negate_all(f):

"""the same when f returns a list of numbers"""

return lambda *args, **kwargs: [-y for y in f(*args, **kwargs)]

def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):

return minimize_batch(negate(target_fn),

negate_all(gradient_fn), theta_0,

tolerance)


Stochastic Gradient Descent

As we mentioned before, often we’ll be using gradient descent to choose the parameters of a model in a way that minimizes some notion of error. Using the previous batch approach, each gradient step requires us to make a prediction and compute the gradient for the whole data set, which makes each step take a long time.

Now, usually these error functions are additive, which means that the predictive error on the whole data set is simply the sum of the predictive errors for each data point.

When this is the case, we can instead apply a technique called stochastic gradient descent, which computes the gradient (and takes a step) for only one point at a time. It cycles over our data repeatedly until it reaches a stopping point.

During each cycle, we’ll want to iterate through our data in a random order:

def in_random_order(data):

"""generator that returns the elements of data in random order""" indexes = [i for i, _ in enumerate(data)] # create a list of indexes random.shuffle(indexes) # shuffle them

for i in indexes: # return the data in that order

yield data[i]

And we’ll want to take a gradient step for each data point. This approach leaves the possibility that we might circle around near a minimum forever, so whenever we stop getting improvements we’ll decrease the step size and eventually quit:

def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01): data = zip(x, y)

theta = theta_0 # initial guess

alpha = alpha_0 # initial step size min_theta, min_value = None, float("inf") # the minimum so far iterations_with_no_improvement = 0

# if we ever go 100 iterations with no improvement, stop

while iterations_with_no_improvement < 100:

value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data )

if value < min_value:

# if we've found a new minimum, remember it # and go back to the original step size min_theta, min_value = theta, value iterations_with_no_improvement = 0

alpha = alpha_0

else:

# otherwise we're not improving, so try shrinking the step size

iterations_with_no_improvement += 1

alpha *= 0.9

# and take a gradient step for each of the data points

for x_i, y_i in in_random_order(data): gradient_i = gradient_fn(x_i, y_i, theta)

theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))

return min_theta

The stochastic version will typically be a lot faster than the batch version. Of course, we’ll want a version that maximizes as well:


def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):

return minimize_stochastic(negate(target_fn),

negate_all(gradient_fn), x, y, theta_0, alpha_0)



from collections import Counter



from linear_algebra import distance, vector_subtract, scalar_multiply


from functools import reduce


import math, random





def sum_of_squares(v):


"""computes the sum of squared elements in v"""


return sum(v_i ** 2 for v_i in v)





def difference_quotient(f, x, h):


return (f(x + h) - f(x)) / h





def plot_estimated_derivative():





def square(x):


return x * x





def derivative(x):


return 2 * x





derivative_estimate = lambda x: difference_quotient(square, x, h=0.00001)





# plot to show they're basically the same


import matplotlib.pyplot as plt


x = range(-10,10)


plt.plot(x, map(derivative, x), 'rx') # red x


plt.plot(x, map(derivative_estimate, x), 'b+') # blue +


plt.show() # purple *, hopefully





def partial_difference_quotient(f, v, i, h):





# add h to just the i-th element of v


w = [v_j + (h if j == i else 0)


for j, v_j in enumerate(v)]





return (f(w) - f(v)) / h





def estimate_gradient(f, v, h=0.00001):


return [partial_difference_quotient(f, v, i, h)


for i, _ in enumerate(v)]





def step(v, direction, step_size):


"""move step_size in the direction from v"""


return [v_i + step_size * direction_i


for v_i, direction_i in zip(v, direction)]





def sum_of_squares_gradient(v):


return [2 * v_i for v_i in v]





def safe(f):


"""define a new function that wraps f and return it"""


def safe_f(*args, **kwargs):


try:


return f(*args, **kwargs)


except:


return float('inf') # this means "infinity" in Python


return safe_f








#


#


# minimize / maximize batch


#


#





def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):


"""use gradient descent to find theta that minimizes target function"""





step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]





theta = theta_0 # set theta to initial value


target_fn = safe(target_fn) # safe version of target_fn


value = target_fn(theta) # value we're minimizing





while True:


gradient = gradient_fn(theta)


next_thetas = [step(theta, gradient, -step_size)


for step_size in step_sizes]





# choose the one that minimizes the error function


next_theta = min(next_thetas, key=target_fn)


next_value = target_fn(next_theta)





# stop if we're "converging"


if abs(value - next_value) < tolerance:


return theta


else:


theta, value = next_theta, next_value





def negate(f):


"""return a function that for any input x returns -f(x)"""


return lambda *args, **kwargs: -f(*args, **kwargs)





def negate_all(f):


"""the same when f returns a list of numbers"""


return lambda *args, **kwargs: [-y for y in f(*args, **kwargs)]





def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):


return minimize_batch(negate(target_fn),


negate_all(gradient_fn),


theta_0,


tolerance)





#


# minimize / maximize stochastic


#





def in_random_order(data):


"""generator that returns the elements of data in random order"""


indexes = [i for i, _ in enumerate(data)] # create a list of indexes


random.shuffle(indexes) # shuffle them


for i in indexes: # return the data in that order


yield data[i]





def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):





data = list(zip(x, y))


theta = theta_0 # initial guess


alpha = alpha_0 # initial step size


min_theta, min_value = None, float("inf") # the minimum so far


iterations_with_no_improvement = 0





# if we ever go 100 iterations with no improvement, stop


while iterations_with_no_improvement < 100:


value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data )





if value < min_value:


# if we've found a new minimum, remember it


# and go back to the original step size


min_theta, min_value = theta, value


iterations_with_no_improvement = 0


alpha = alpha_0


else:


# otherwise we're not improving, so try shrinking the step size


iterations_with_no_improvement += 1


alpha *= 0.9





# and take a gradient step for each of the data points


for x_i, y_i in in_random_order(data):


gradient_i = gradient_fn(x_i, y_i, theta)


theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))





return min_theta





def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):


return minimize_stochastic(negate(target_fn),


negate_all(gradient_fn),


x, y, theta_0, alpha_0)





if __name__ == "__main__":





print("using the gradient")





v = [random.randint(-10,10) for i in range(3)]





tolerance = 0.0000001





while True:


#print v, sum_of_squares(v)


gradient = sum_of_squares_gradient(v) # compute the gradient at v


next_v = step(v, gradient, -0.01) # take a negative gradient step


if distance(next_v, v) < tolerance: # stop if we're converging


break


v = next_v # continue if we're not





print("minimum v", v)


print("minimum value", sum_of_squares(v))


print()








print("using minimize_batch")





v = [random.randint(-10,10) for i in range(3)]





v = minimize_batch(sum_of_squares, sum_of_squares_gradient, v)





print("minimum v", v)


print("minimum value", sum_of_squares(v))