Where to Find a Dataset of High-Quality MATLAB Code for Training an LLM?

20 views (last 30 days)
Hello,
I’m working on fine-tuning an open-source LLM for MATLAB code generation, aiming to reach a performance level similar to ChatGPT. I haven’t been impressed with the results of existing tools so far.
Could anyone point me toward quality datasets or resources specifically for training an LLM on MATLAB code? I’m particularly interested in datasets that cover a wide range of MATLAB applications, from basic scripts to more advanced numerical computations, optimization, and data analysis.
Any guidance or pointers would be greatly appreciated!

Answers (1)

John D'Errico
John D'Errico on 22 Apr 2025
Looking at your question a second time, it is about MATLAB in a sense. in that you are looking for a repository of code to train an LLM upon.
You might look at the File Exchange, which is probably the largest repository of MATLAB code out there besides MATLAB itself. The problem is, the FEX tends to include much poorly written code. Sorry, but it does. It has some truly great code too, written by many superb authors. But there is much novice code too. And some of the code there is pretty old. I'll admit that some of my own FEX contributions are at least 25 years old. And that makes them somewhat less useful for training purposes, since MATLAB has grown in that time.
You might also need to consider licensing issues, IF you decide to use code from a source like the FEX, or any such source to train an LLM. In my case, for example, while I am quite happy to see my code used with attribution, I'm not so sure how happy I would be at the idea of an LLM effectively using my code with no attribution at all.

Categories

Find more on Get Started with MATLAB in Help Center and File Exchange

Products


Release

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!