MATLAB's inefficient copy-on-write implementation

29 views (last 30 days)
MATLAB's copy-on-write memory management seems to have a serious defect, which I think is the reason behind the abysmal performance of subsasgn overloading. (The same problem probably occurs with parenAssign in the new R2021b RedefinesParen class -- I haven't yet experimented with it.) Normally, an array assignment like b = a simply does a pointer copy; the array data is not copied until b is modified (e.g. b(1) = 1). Thereafter, subsequent modification of b (e.g. b(2) = 1) do not copy the full array; they just modify it in place as long as the reference count is 1. For example,
clear, a = zeros(1e8,1);
memory % 2764 MB used by MATLAB
b = a;
memory % 2764 MB
tic, b(1) = 1; toc, memory % 0.329099 seconds, 3540 MB
tic, b(2) = 1; toc, memory % 0.000123 seconds, 3541 MB
However, the benefit of copy-on-write is lost when the variable is changed in a function, e.g.
% test.m
function x = test(x)
x(1) = 1;
In this case, the x reference count is apparently incremented in test before the assignment is made, so this will always result in a full array copy. For example,
clear, a = zeros(1e8,1);
tic, a = test(a); toc % 0.337475 seconds
tic, a = test(a); toc % 0.310373 seconds
To see what's happening with copy-on-write, test.m is modified as follows:
function x = test(x)
memory
x(1) = 1;
memory
return
The array modification inside the function forces a full array copy, even though the original array is immediately discarded:
clear, a = zeros(1e8,1);
memory % 2748 MB
a = test(a); % 2748 MB, 3503 MB
memory % 2740 MB
I would think this problem could be easily avoided by treating any variable that appears as both an input and output argument in a function (e.g. function x = test(x)) as a reference variable, i.e. its reference count is not incremented on entering the function and is not decremented upon exiting. If the function is called with different input and output arguments, e.g. y = test(x), then the interpreter would implement this as y = x; y = test(y).
Is there any particular reason why MATLAB does not or cannot do this? There are many applications such as subasgn overloading that could see a big performance boost if this problem is fixed.
  1 Comment
James Tursa
James Tursa on 31 Jan 2022
Slight point of confusing terms with your description. In the past, MATLAB has passed shared data copies of arguments to functions, not bumping up reference counts. Do you have evidence or know of documentation that shows a change in this behavior, and that now a bumped up reference count method is used for arguments? Why do you write that MATLAB uses this method?

Sign in to comment.

Accepted Answer

James Tursa
James Tursa on 31 Jan 2022
Edited: James Tursa on 31 Jan 2022
See Loren's Blog on this topic. Basically, to write functions that can modify a variable "inplace" you need to call that function from within another function and follow some syntax rules. Then you can avoid the deep data copy.
There is a subtle caveat to this. If the variable is already shared, then the function will be forced to make a deep copy regardless of how you call it or what syntax you use. And there are no official MATLAB functions that can tell you the sharing status of a variable ahead of time, so it can be hard to predict when a deep copy will be forced and when it will not be forced. E.g.,
X = rand(10); % X will not be shared with anything at this point
Y = 1:10; % Y will be shared with a background variable that is hidden from you
It is not obvious that the simple assignment for Y above should result in shared variables, but that is exactly what happens on later versions of MATLAB for certain sized variables (it will be a reference copy). In this case any attempt to modify Y inplace will result in a deep data copy first.
  11 Comments
Paul
Paul on 2 Feb 2022
Thanks for the response. Frankly, I don't see how "pass by value with lazy copy" adds any clarity to that portion of the doc page, which is specifically explaining how f1() works, where I don't see any kind of pass by value at all.
Regardless of what the specific wording should be, I appreciate your response and initiative to pass along the concern to the doc writers.

Sign in to comment.

More Answers (1)

Matt J
Matt J on 31 Jan 2022
Edited: Matt J on 31 Jan 2022
(1) The variable must be allocated within a function.
A workaround to this rule is to wrap the data in a handle object:
a = 1:1e8;
tic,
obj=refwrap(a); clear a
testFn(obj);
a=obj.data;
toc %Elapsed time is 0.000460 seconds.
function testFn(obj)
obj.data(1) = 1;
end

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!