chapter of the in-progress e-book on linear algebra, “A birds eye view of linear algebra”. The desk of contents to this point:
Keep tuned for future chapters.
Right here, we are going to describe operations we are able to do with two matrices, however maintaining in thoughts they’re simply representations of linear maps.
I) Why care about matrix multiplication?
Nearly any info could be embedded in a vector area. Photographs, video, language, speech, biometric info and no matter else you possibly can think about. And all of the purposes of machine studying and synthetic intelligence (just like the latest chat-bots, textual content to picture, and so on.) work on high of those vector embeddings. Since linear algebra is the science of coping with excessive dimensional vector areas, it’s an indispensable constructing block.

Numerous the strategies contain taking some enter vectors from one area and mapping them to different vectors from another area.
However why the deal with “linear” when most fascinating capabilities are non-linear? It’s as a result of the issue of constructing our fashions excessive dimensional and that of constructing them non-linear (basic sufficient to seize every kind of complicated relationships) grow to be orthogonal to one another. Many neural community architectures work by utilizing linear layers with easy one dimensional non-linearities in between them. And there’s a theorem that claims this sort of structure can mannequin any operate.
Because the means we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it’s the bedrock of the fashionable AI revolution.

II) Algebra on maps
In chapter 2, we learnt the way to quantify linear maps with determinants. Now, let’s do some algebra with them. We’ll want two linear maps and a foundation.

II-A) Addition
If we are able to add matrices, we are able to add linear maps since matrices are the representations of linear maps. And matrix addition will not be very fascinating if scalar addition. Simply as with vectors, it’s solely outlined if the 2 matrices are the identical measurement (similar rows and columns) and includes lining them up and including aspect by aspect.

So, we’re simply doing a bunch of scalar additions. Which implies that the properties of scalar addition logically lengthen.
Commutative: for those who change, the consequence received’t twitch
A+B = B+A
However commuting to work may not be commutative since going from A to B may take longer than B to A.
Associative: in a sequence, don’t chorus, take any 2 and proceed
A+(B+C) = (A+B)+C
Id: And right here I’m the place I started! That’s no solution to deal with a person!
The presence of a particular aspect that when added to something ends in the identical factor. Within the case of scalars, it’s the quantity 0. Within the case of matrices, it’s a matrix filled with zeros.
A + 0 = A or 0 + A = A
Additionally, it’s potential to begin at any aspect and find yourself at another by way of addition. So it should be potential to begin at A and find yourself on the additive id, 0. The factor that should be added to A to realize that is the additive inverse of A and it’s known as -A.
A + (-A) = 0
For matrices, you simply go to every scalar aspect within the matrix and substitute with the additive inverse of every one (switching the indicators if the scalars are numbers) to get the additive inverse of the matrix.
II-B) Subtraction
Subtraction is simply addition with the additive inverse of the second matrix as a substitute.
A-B = A+(-B)
II-C) Multiplication
We might have outlined matrix multiplication simply as we outlined matrix addition. Simply take two matrices which might be the identical measurement (rows and columns) after which multiply the scalars aspect by aspect. There’s a identify for that sorts of operation, the Hadamard product.
However no, we outlined matrix multiplication as a much more convoluted operation, extra “unique” than addition. And it isn’t complicated only for the sake of it. It’s a very powerful operation in linear algebra by far.
It enjoys this particular standing as a result of it’s the means by which linear maps are utilized to vectors, constructing on high of dot merchandise.
The way in which it truly works requires a devoted part, so we’ll cowl that in part III. Right here, let’s checklist a few of its properties.
Commutative
Not like addition, matrix multiplication will not be at all times commutative. Which implies that the order through which you apply linear maps to your enter vector issues.
A.B != B.A
Associative
It’s nonetheless associative
A.B.C = A.(B.C) = (A.B).C
And there’s a lot of depth to this property, as we are going to see in part IV.
Id
Similar to addition, matrix multiplication additionally has an id aspect, I, a component that when any matrix is multiplied to ends in the identical matrix. The massive caveat being that this aspect solely exists for sq. matrices and is itself sq..
Now, due to the significance of matrix multiplication, “the id matrix” generally is outlined because the id aspect of matrix multiplication (not that of addition or the Hadamard product for instance).
The id aspect for addition is a matrix composed of 0’s and that of the Hadamard product is a matrix composed of 1’s. The id aspect of matrix multiplication is:

So, 1’s on the primary diagonal and 0’s all over the place else. What sort of definition for matrix multiplication would result in an id aspect like this? We’ll want to explain the way it works to see, however first let’s go to the ultimate operation.
II-D) Division
Simply as with addition, the presence of an id matrix suggests any matrix, A could be multiplied with one other matrix, A^-1 and brought to the id. That is known as the inverse. Since matrix multiplication isn’t commutative, there are two methods to this. Fortunately, each result in the id matrix.
A.(A^-1) = (A^-1).A = I
So, “dividing” a matrix by one other is solely multiplication with the second ones inverse, A.B^-1. If matrix multiplication is essential, then this operation is as nicely because it’s the inverse. Additionally it is associated to how we traditionally developed (or perhaps stumbled upon) linear algebra. However extra on that within the subsequent chapter (4).
One other property we’ll be utilizing that could be a mixed property of addition and multiplication is the distributive property. It applies to every kind of matrix multiplication from the normal one to the Hadamard product:
A.(B+C) = A.B + A.C
III) Why is matrix multiplication outlined this fashion?
We now have arrived eventually to the part the place we are going to reply the query within the title, the meat of this chapter.
Matrix multiplication is the best way linear maps act on vectors. So, we get to inspire it that means.
III-A) How are linear maps utilized in observe?
Take into account a linear map that takes m dimensional vectors (from R^m) as enter and maps them to n dimensional vectors (in R^n). Let’s name the m dimensional enter vector, v.
At this level, it is likely to be useful to think about your self truly coding up this linear map in some programming language. It ought to be a operate that takes the m-dimensional vector, v as enter and returns the n dimensional vector, u.
The linear map has to take this vector and switch it into an n dimensional vector one way or the other. Within the operate above, you’ll discover we simply generated some vector at random. However this fully ignored the enter vector, v. That’s unreasonable, v ought to have some say. Now, v is simply an ordered checklist of m scalars v = [v1, v2, v3, …, vm]. What do scalars do? They scale vectors. And the output vector we want ought to be n dimensional. How about we take some (mounted) m vectors (pulled out of skinny air, every n dimensional), w1, w2, …, wm. Then, scale w1 by v1, w2 by v2 and so forth and add all of them up. This results in an equation for our linear map (with the output on the left).

Make word of the equation (1) above since we’ll be utilizing it once more.
Because the w1, w2,… are all n dimensional, so is u. And all the weather of v=[v1, v2, …, vm] have an affect on the output, u. The concept in equation (1) is carried out under. We take some randomly generated vectors for the w’s however with mounted seeds (guaranteeing that the vectors are the identical throughout each name of the operate).
We now have a means now to “map” m dimensional vectors (v) to n dimensional vectors (u). However does this “map” fulfill the properties of a linear map? Recall from chapter-1, part II the properties of a linear map, f (right here, a and b are vectors and c is a scalar):
f(a+b) = f(a) + f(b)
f(c.a) = c.f(a)
It’s clear that the map specified by equation (1) satisfies the above two properties of a linear map.


The m vectors, w1, w2, …, wm are arbitrary and it doesn’t matter what we select for them, the operate, f outlined in equation (1) is a linear map. So, totally different selections for these w vectors ends in totally different linear maps. Furthermore, for any linear map you possibly can think about, there might be some vectors w1, w2,… that may be utilized along with equation (1) to symbolize it.
Now, for a given linear map, we are able to acquire the vectors w1, w2,… into the columns of a matrix. Such a matrix may have n rows and m columns. This matrix represents the linear map, f and its multiplication with an enter vector, v represents the applying of the linear map, f to v. And this utility is the place the definition of matrix multiplication comes from.

We will now see why the id aspect for matrix multiplication is the best way it’s:

We begin with a column vector, v and finish with a column vector, u (so only one column for every of them). And because the parts of v should align with the column vectors of the matrix representing the linear map, the variety of columns of the matrix should equal the variety of parts in v. Extra on this in part III-C.
III-B) Matrix multiplication as a composition of linear maps
Now that we described how a matrix is multiplied to a vector, we are able to transfer on to multiplying a matrix with one other matrix.
The definition of matrix multiplication is far more pure after we think about the matrices as representations of linear maps.
Linear maps are capabilities that take a vector as enter and produce a vector as output. Let’s say the linear maps corresponding to 2 matrices are f and g. How would you consider including these maps (f+g)?
(f+g)(v) = f(v)+g(v)
That is harking back to the distributive property of addition the place the argument goes contained in the bracket to each the capabilities and we add the outcomes. And if we repair a foundation, this corresponds to making use of each linear maps to the enter vector and including the consequence. By the distributive property of matrix and vector multiplication, this is identical as including the matrices similar to the linear maps and making use of the consequence to the vector.
Now, let’s consider multiplication (f.g).
(f.g)(v) = f(g(v))
Since linear maps are capabilities, probably the most pure interpretation of multiplication is to compose them (apply them one by one, in sequence to the enter vector).
When two matrices are multiplied, the ensuing matrix represents the composition of the corresponding linear maps. Take into account matrices A and B; the product AB embodies the transformation achieved by making use of the linear map represented by B to the enter vector first after which making use of the linear map represented by A.
So now we have a linear map similar to the matrix, A and a linear map similar to the matrix, B. We’d wish to know the matrix, Csimilar to the composition of the 2 linear maps. So, making use of B to any vector first after which making use of A to the consequence ought to be equal to only making use of C.
A.(B.v) = C.v = (A.B).v
Within the final part, we learnt the way to multiply a matrix and a vector. Let’s try this twice for A.(B.v). Say the columns of B are the column vectors, b1, b2, …, bm. From equation (1) within the earlier part,

And what if we utilized the linear map similar to C=A.B on to the vector, v. The column vectors of the matrix C are c1, c2, …, ck.

Evaluating the 2 equations above we get,

So, the columns of the product matrix, C=AB are obtained by making use of the linear map similar to matrix A to every of the columns of the matrix B. And amassing these ensuing vectors right into a matrix provides us C.
We now have simply prolonged our matrix-vector multiplication consequence from the earlier part to the multiplication of two matrices. We simply break the second matrix into a group of vectors, multiply the primary matrix to all of them and acquire the ensuing vectors into the columns of the consequence matrix.

So the primary row and first column of the consequence matrix, C is the dot product of the primary column of B and the primary row of A. And generally the i-th row and j-th column of C is the dot product of the i-th row of A and the j-th column of B. That is the definition of matrix multiplication most of us first study.

Associative proof
We will additionally present that matrix multiplication is associative now. As a substitute of the only vector, v, let’s apply the product C=AB individually to a bunch of vectors, w1, w2, …, wl. Let’s say the matrix that has these as column vectors is W. We will use the very same trick as above to point out:
(A.B).W = A.(B.W)
It’s as a result of (A.B).w1 = A.(B.w1) and the identical for all the opposite w vectors.
Sum of outer merchandise
Say we’re multiplying two matrices A and B:

Equation (3) could be generalized to point out that the i,j aspect of the ensuing matrix, C is:

We now have a sum over okay phrases. What if we took every of these phrases and created okay particular person matrices out of them. For instance, the primary matrix may have as its i,j-th entry: b_{i,1}. a_{1,j}. The okay matrices and their relationship to C:

This means of summing over okay matrices could be visualized as follows (harking back to the animation in part III-A that visualized a matrix multiplied to a vector):

We see right here the sum over okay matrices all the similar measurement (nxm) which is identical measurement because the consequence matrix, C. Discover in equation (4) how for the primary matrix, A, the column index stays the identical whereas for the second matrix, B, the row index stays the identical. So the okay matrices we’re getting are the matrix merchandise of the i-th column of A and the i-th row of B.
Matrix multiplication as a sum of outer merchandise. Picture by creator.
Contained in the summation, two vectors are multiplied to provide matrices. It’s a particular case of matrix multiplication when utilized to vectors (particular instances of matrices) and known as “outer product”. Right here is yet one more animation to point out this sum of outer merchandise course of:

This tells us why the variety of row vectors in B ought to be the identical because the variety of column vectors in A. As a result of they need to be mapped collectively to get the person matrices.
We’ve seen plenty of visualizations and a few math, now let’s see the identical factor by way of code for the particular case the place A and B are sq. matrices. That is based mostly on part 4.2 of the e-book “Introduction to Algorithms”, [2].
III-C) Matrix multiplication: the structural selections

Matrix multiplication appears to be structured in a bizarre means. It’s clear that we have to take a bunch of dot merchandise. So, one of many dimensions has to match. However why make the columns of the primary matrix be equal to the variety of rows of the second?
Received’t it make issues extra easy if we redefine it in a means that the variety of rows of the 2 matrices ought to be the identical (or the variety of columns)? This may make it a lot simpler to determine when two matrices could be multiplied.
The standard definition the place we require the rows of the primary matrix to align with the columns of the second has multiple benefit. Let’s go first to matrix-vector multiplication. Animation (1) in part III-A confirmed us how the normal model works. Let’s visualize what it if we required the rows of the matrix to align with the variety of parts within the vector as a substitute. Now, the n rows of the matrix might want to align with the nparts of the vector.

We see that we’d have to begin with a column vector, v with n rows and one column and find yourself with a row vector, u with 1 row and m columns. That is awkward and makes defining an id aspect for matrix multiplication difficult because the enter and output vectors can by no means have the identical form. With the normal definition, this isn’t a difficulty because the enter is a column vector and the output can be a column vector (see animation (1)).
One other consideration is multiplying a sequence of matrices. Within the conventional methodology, it’s so simple to see to start with that the chain of matrices under could be multiplied collectively based mostly on their dimensionalities.

Additional, we are able to inform that the output matrix may have l rows and p columns.
Within the framework the place the rows of the 2 matrices ought to line up, this rapidly turns into a multitude. For the primary two matrices, we are able to inform that the rows ought to align and that the consequence may have n rows and l columns. However visualizing what number of rows and columns the consequence may have after which reasoning about climate it’ll be suitable with C, and so on. turns into a nightmare.

And that’s the reason we require the rows of the primary matrix to align with the columns of the second matrix. However perhaps I missed one thing. Possibly there may be an alternate definition that’s “cleaner” and supervisor to side-step these two challenges. Would love to listen to concepts within the feedback 🙂
III-D) Matrix multiplication as a change of foundation
To date, we’ve considered matrix multiplication with vectors as a linear map that takes a vector as enter and returns another vector as output. However there may be one other means to think about matrix multiplication — as a solution to change perspective.
Let’s think about two-dimensional area, R². We symbolize any vector on this area with two numbers. What do these numbers symbolize? The coordinates alongside the x-axis and y-axis. A unit vector that factors simply alongside the x-axis is [1,0] and one which factors alongside the y-axis is [0,1]. These are our foundation for the area. Each vector now has an handle. For instance, the vector [2,3] means we scale the primary foundation vector by 2 and the second by 3.
However this isn’t the one foundation for the area. Another person (say, he who shall not be named) may need to use two different vectors as their foundation. For instance, the vectors e1=[3,2] and e2=[1,1]. Any vector within the area R² will also be expressed of their foundation. The identical vector would have totally different representations in our foundation and their foundation. Like totally different addresses for a similar home (maybe based mostly on totally different postal techniques).
After we’re within the foundation of he who shall not be named, the vector e1 = [1,0]and the vector e2 = [0,1] (that are the premise vectors from his perspective by definition of foundation vectors). And the capabilities that interprets vectors from our foundation system to that of he who shall not be named and vise-versa are linear maps. And so the translations could be represented as matrix multiplications. Let’s name the matrix that takes vectors from us to the vectors to he who shall not be named, M1 and the matrix that does the other, M2. How do we discover the matrices for these matrices?

We all know that the vectors we name e1=[3,2] and e2=[1,1], he who shall not be named calls e1=[1,0] and e2=[0,1]. Let’s acquire our model of the vectors into the columns of a matrix.

And in addition acquire the vectors, e1 and e2 of he who shall not be named into the columns of one other matrix. That is simply the id matrix.

Since matrix multiplication operates independently on the columns of the second matrix,

Pre-multiplying by an acceptable matrix on each side provides us M1:

Doing the identical factor in reverse provides us M2:

This will all be generalized into the next assertion: A matrix with column vectors; w1, w2, …, wn interprets vectors expressed in a foundation the place w1, w2, …, wn are the premise vectors to our foundation.
And the inverse of that matrix interprets vectors from our foundation to the one the place w1, w2, …, wn are the premise.
All sq. matrices can therefore be regarded as “foundation changers”.
Notice: Within the particular case of an orthonormal matrix (the place each column is a unit vector and orthogonal to each different column), the inverse turns into the identical because the transpose. So, altering to the premise of the columns of such a matrix turns into equal to taking the dot product of a vector with every of the rows.
For extra on this, see the 3B1B video, [1].
Conclusion
Matrix multiplication is arguably one of the vital vital operations in trendy computing and likewise with nearly any information science subject. Understanding deeply the way it works is vital for any information scientist. Most linear algebra textbooks describe the “what” however not why its structured the best way it’s. Hopefully this weblog stuffed that hole.
[1] 3B1B video on change of foundation: https://www.youtube.com/watch?v=P2LTAUO1TdA&t=2s
[2] Introduction to Algorithms by Cormen et.al. Third version
[3] Matrix multiplication as sum of outer merchandise: https://math.stackexchange.com/questions/2335457/matrix-at-a-as-sum-of-outer-products
[4] Catalan numbers wikipedia article https://en.wikipedia.org/wiki/Catalan_number