Studying Triton One Kernel at a Time: Softmax

Within the earlier article of this sequence, operation in all fields of pc science: matrix multiplication. It’s closely utilized in neural networks to compute the activation of linear layers. Nonetheless, activations on their very own are tough to interpret, since their values and statistics (imply, variance, min-max amplitude) can fluctuate wildly from layer to layer. This is likely one of the the reason why we use activation capabilities, for instance the logistic perform (aka sigmoid) which tasks any actual quantity within the [0; 1] vary.

The softmax perform, also referred to as the normalised exponential perform, is a multi-dimensional generalisation of the sigmoid. It converts a vector of uncooked scores (logits) right into a chance distribution over M lessons. We are able to interpret it as a weighted common that behaves as a clean perform and could be conveniently differentiated. It’s a essential part of dot-product consideration, language modeling, and multinomial logistic regression.

On this article, we’ll cowl: