Gradient Descent Into Madness: My PhD Journey

I have been trying to optimize my life using gradient descent for approximately six years. I am writing this piece to report that I have reached a local minimum, and it is not the global one.

The appeal of the metaphor is obvious to anyone who has spent significant time watching a loss curve fail to decrease. Life, like a poorly-initialized neural network, is full of plateaus, saddle points, and that one region where the gradient is technically defined but points directly toward a career in industry.

The Learning Rate Problem

My primary methodological error has been inconsistency of learning rate. In my early twenties, I used an aggressive step size — accepting every opportunity, attending every social event, agreeing to review papers for journals I had never read. This produced rapid movement, but the trajectory was largely random. I overshot optimal configurations repeatedly. I committed to a research area that turned out to be “not what the field was moving toward.” I accepted a postdoc offer based on a supervisor’s enthusiasm during a conference dinner that, in retrospect, may have been attributable to the wine.

Later, chastened, I reduced my learning rate to near zero. I became cautious. I reviewed papers very carefully before declining to take risks on them. I attended only conferences where I was already certain I would enjoy the buffet (see my previous work on conference buffet taxonomy). Convergence slowed to the point where I question whether it is occurring at all.

Momentum and the Problem of Prior Velocity

Momentum-based optimization maintains a weighted average of previous gradients, allowing the optimizer to continue in a consistent direction even when the current gradient is uninformative. I have found this maps precisely onto the experience of maintaining a PhD project for four years after every rational signal indicates the project should be abandoned. The momentum of sunk costs is considerable. It has carried me through several years of experiments that did not work, toward a thesis that was eventually accepted on the grounds that it was long enough to be mistaken for thorough.

Conclusion: On Local Minima

I have arrived, through years of noisy stochastic optimization, at a configuration that is stable. My loss has not decreased measurably in 18 months. I attend the same conferences. I write variations on the same papers. I eat variations on the same buffet lunches (see above).

Whether this represents convergence or entrapment, I genuinely cannot tell. The gradient is approximately zero either way. I am considering adding noise and seeing what happens. This strategy is called “accepting a new postdoc offer” and I expect to publish results within three to five years.

This commentary was written during a period of reflection described by my institution as “sabbatical” and by my family as “not answering emails.”