Ever since I used to be a baby, I’ve been fascinated by drawing. What struck me was not solely the drawing act itself, but additionally the concept that each drawing may very well be improved an increasing number of. I keep in mind reaching very excessive ranges with my drawing model. Nonetheless, as soon as I reached the height of perfection, I’d attempt to see how I might enhance the drawing even additional – alas, with disastrous outcomes.
From there I at all times be mindful the identical mantra: “refine and iterate and also you’ll attain perfection”. At college, my method was to learn books many occasions, increasing my information looking for different sources, for locating hidden layers of which means in every idea. At present, I apply this identical philosophy to AI/ML and coding.
We all know that matrix multiplication (matmul for simplicity right here), is the core a part of any AI course of. Again previously I developed LLM.rust, a Rust mirror of Karpathy’s LLM.c. The toughest level within the Rust implementation has been the matrix multiplication. Since now we have to carry out hundreds of iterations for fine-tuning a GPT-based mannequin, we want an environment friendly matmul operation. For this objective, I had to make use of the BLAS library, implementing an unsafe technique for overcoming the bounds and obstacles. The utilization of unsafe in Rust is in opposition to Rust’s philosophy, that’s why I’m at all times on the lookout for safer strategies for enhance matmul on this context.
So, taking inspiration from Sam Altman’s assertion – “ask GPT how one can create worth” – I made a decision to ask native LLMs to generate, benchmark, and iterate on their very own algorithms to create a greater, native Rust matmul implementation.
The problem has some constraints:
- We have to use our native surroundings. In my case, a MacBook Professional, M3, 36GB RAM;
- Overcome the bounds of tokens;
- Time and benchmark the code inside the technology loop itself
I do know that reaching BLAS-level performances with this technique is nearly unattainable, however I need to spotlight how we will leverage AI for customized wants, even with our “tiny” laptops, in order that we will unblock concepts and push boundaries in any discipline. This put up needs to be an inspiration for practitioners, and individuals who need to get extra acquainted with Microsoft Autogen, and native LLM deployment.
All of the cod implementation will be discovered on this Github repo. That is an on-going experiment, and lots of adjustments/enhancements will probably be dedicated.
Normal thought
The general thought is to have a roundtable of brokers. The place to begin is the MrAderMacher Mixtral 8x7B mannequin This autumn K_M native mannequin. From the mannequin we create 5 entities:
- the
Proposercomes up with a brand new Strassen-like algorithm, to discover a higher and extra environment friendly option to carry out matmul; - the
Verifieropinions the matmul formulation by way of symbolic math; - the
Codercreates the underlying Rust code; - the
Testerexecutes it and saves all the information to the vector database; - the
Supervisoracts silently, controlling the general workflow.
| Agent | Position perform |
| Proposer | Analyses benchmark occasions, and it proposes new tuning parameters and matmul formulations. |
| Verifier | (At present disabled within the code). It verifies the proposer’s mathematical formulation by way of symbolic verification. |
| Coder | It takes the parameters, and it really works out the Rust template code. |
| Tester | It runs the Rust code, it saves the code and computes the benchmark timing. |
| Supervisor | General management of the workflow. |
The general workflow will be orchestrated by way of Microsoft Autogen as depicted in fig.1.

Put together the enter knowledge and vector database
The enter knowledge is collected from all tutorial papers, centered on matrix multiplication optimisation. Many of those papers are referenced in, and associated to, DeepMind’s Strassen paper. I need to begin merely, so I collected 50 papers, printed from 2020 until 2025, that particularly handle matrix multiplication.
Subsequent, I’ve used chroma to create the vector database. The essential facet in producing a brand new vector database is how the PDFs are chunked. On this context, I used a semantic chunker. In another way from break up textual content strategies, the semantic chunker makes use of the precise which means of the textual content, to find out the place to chop. The objective is to maintain the associated sentences collectively in a single chunk, making the ultimate vector database extra coherent and correct. That is performed utilizing the native mannequin BAAI/bge-base-en-v1.5. The Github gist beneath exhibits the complete implementation.
The core code: autogen-core and GGML fashions
I’ve used Microsoft Autogen, particularly the autogen-core variant (model 0.7.5). In another way from the higher-level chat, in autogen-core we will have entry to low-level event-driven constructing blocks, which might be essential to create a state-machine-driven workflow as we want. As a matter of truth, the problem is to take care of a strict workflow. All of the appearing brokers should act in a selected order: Proposer –> Verifier –> Coder –> Tester.
The core half is the BaseMatMulAgent, that inherits from AutoGen’s RoutedAgent. This base class permits us to standardise how LLM brokers will participate within the chat, and they’ll behave.
From the code above, we will see the category is designed to take part in an asynchronous group chat, dealing with dialog historical past, calls to exterior instruments and producing responses by way of the native LLM.
The core element is @message_handler, a decorator that registers a technique as listener or subscriber , primarily based on the message kind. The decorator routinely detects the kind trace of the primary technique’s argument – in our case is message: GroupChatMessage. It then subscribes the agent to obtain any occasions of that kind despatched to the agent’s matter. The handle_message async technique is then answerable for updating the agent’s inside reminiscence, with out producing a response.
With the listener-subscriber mechanism is in place, we will deal with the Supervisor class. The MatMulManager inherits RoutedAgent and orchestrates the general brokers’ stream.
The code above handles all of the brokers. We’re skipping the Verifier half, for the second. The Coder publish the ultimate code, and the Tester takes care of saving each the code and the entire context to the Vector Database. On this manner, we will keep away from consuming all of the tokens of our native mannequin. At every new run, the mannequin will catch-up on the most recent generated algorithms from the vector database and suggest a brand new answer.
A vital caveat, for ensuring autogen-core can work with llama fashions on MacOS, make use of the next snippet:
#!/bin/bash
CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip set up --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir
Fig.2 summarises your entire code. We will roughly subdivide the code into 3 most important blocks:
- The
BaseAgent, that handles messages by way of LLM’s brokers, evaluating the mathematical formulation and producing code; - The
MatMulManagerorchestrates your entire brokers’ stream; autogen_core.SingleThreadedAgentRuntimepermits us to make your entire workflow a actuality.

autogen_core.SingleThreadedAgentRuntime makes all of this to work on our MacBook PRO. [Image created with Nano Banana Pro.]Outcomes and benchmark
All of the Rust code has been revised and re-run manually. Whereas the workflow is strong, working with LLMs requires a essential eye. A number of occasions the mannequin confabulated*, producing code that appeared optimised however did not carry out the precise matmul work.
The very first iteration generates a type of Strassen-like algorithm (“Run 0” code within the fig.3):
The mannequin thinks of higher implementations, extra Rust-NEON like, in order that after 4 iterations it offers the next code (“Run 3” in fig.3):
We will see the utilization of features like vaddq_f32, particular CPU instruction for ARM processors, coming from std::arch::aarch64. The mannequin manages to make use of rayon to separate the workflow throughout a number of CPU cores, and contained in the parallel threads it makes use of NEON intrinsics. The code itself just isn’t completely appropriate, furthermore, I’ve observed that we’re operating into an out-of-memory error when coping with 1024×1024 matrices. I needed to manually re-work out the code to make it work.
This brings us again to our my mantra “iterating to perfection”, and we will ask ourselves: ‘can an area agent autonomously refine Rust code to the purpose of mastering advanced NEON intrinsics?’. The findings present that sure, even on client {hardware}, this degree of optimisation is achievable.
Fig.3 exhibits the ultimate outcomes I’ve obtained after every iterations.

The 0th and 2nd benchmark have some errors, as it’s bodily unattainable to realize such a outcomes on a 1024×1024 matmul on a CPU:
- the primary code suffers from a diagonal fallacy, so the code is computing solely diagonal blocks of the matrix and it’s ignoring the remaining;
- the second code has a damaged buffer, as it’s repeatedly overwriting a small, cache-hot buffer 1028 floats, somewhat than transversing the complete 1 million parts.
Nonetheless, the code produced two actual code, the run 1 and run 3. The primary iteration achieves 760 ms, and it constitutes an actual baseline. It suffers from cache misses and lack of SIMD vectorisation. The run 3 data 359 ms, the development is the implementation of NEON SIMD and Rayon parallelism.
*: I wrote “the mannequin confabulates” on functions. From a medical point-of-view, all of the LLMs should not hallucinating, however confabulating. Hallucinations are a completely completely different state of affairs w.r.t what LLMs are doing when babbling and producing “improper” solutions.
Conclusions
This experiment began with a query that appeared an unattainable problem: “can we use consumer-grade native LLMs to find high-performance Rust algorithms that may compete with BLAS implementation?”.
We will say sure, or at the very least now we have a sound and strong background, the place we will construct up higher code to realize a full BLAS-like code in Rust.
The put up confirmed how one can work together with Microsoft Autogen, autogen-core, and how one can create a roundtable of brokers.
The bottom mannequin in use comes from GGUF, and it could actually run on a MacBook Professional M3, 36GB.
After all, we didn’t discover (but) something higher than BLAS in a single easy code. Nonetheless, we proved that native agentic workflow, on a MacBook Professional, can obtain what was beforehand thought to require an enormous cluster and large fashions. Ultimately, the mannequin managed to discover a affordable Rust-NEON implementation, “Run 3 above”, that has a velocity up of over 50% on normal Rayon implementation. We should spotlight that the spine implementation was AI generated.
The frontier is open. I hope this blogpost can encourage you in making an attempt to see what limits we will overcome with native LLM deployment.
I’m scripting this in a private capability; these views are my very own.


