Hello friends!
I am working on parsing a large library (tens of thousands) of XML files. My intention is to parse through all of them and save the information I need in a single variable for post processing (and to save it in another format that isn't as nested).
The files are not custom made by me or my team, and they have a nested structure that is rather convoluted. In pseudo code; my current parfor loop iterator looks like this:
data_out = ["colum_header_1", "colum_header_2", "colum_header_n"]
parfor z = 1:numFiles
file = xml2struct(fullFileNames{z});
* for i = 1:length(file.logfile.scan)
%Header Info
var_1 = convertCharstoStrings(file.logfile.Attributes.var1)
var_2 = convertCharstoStrings(file.logfile.Attributes.var2)
var_n = convertCharstoStrings(file.logfile.Attributes.var3)
%now sometimes the scane will be singular and sometimes there are multiple, so I have an if case to filter that out and prevent an error of indexing. Ommited, and showing only the multiple scan case.
first_section_file = file.logfile.scan{1,i}
try
** for j = 1:length(first_section_file)
%Here I need some data from let's say. firstsection.info_1.Attributes Additionally there is another structure in this point, let's say info_2 where I also have to get data out. However, as with scan, it can have a singular reading, or multiple readings. As such I have an if else
second_var_1 = firstsection.info_1.Attributes.var1
second_var_2 = firstsection.info_1.Attributes.var2
second_var_3 = firstsection.info_1.Attributes.var3
if reading == 1
third_var_1 = convertCharstoStrings(first_section_file.info_2.Attributes.var1)
third_var_2 = convertCharstoStrings(first_section_file.info_2.Attributes.var2)
third_var_n = convertCharstoStrings(first_section_file.info_2.Attributes.var3)
else
%same code as I would be just getting the information out from the given reading and then iterating over it.
end
data_out = variables
end %Here I end the ** for loop
catch
fprintf(No data)
end % End of the try
end %Here I end the * for loop. end of scane
end %End of code.
My intetion is making the first loop the parfor loop, that way I will be using different workers per file. The problem I have is setting the "data_out " variable appropriately so that the data I need to be saved on it is saved. As the code stands, I don't have a problem with the parfor, but rather I think it's a "race condition" of sorts where given that each loop resets the value, it never saves anything.
I tried setting it up as data_out = [data_out ; variables], but that results in an error from the parfor and using cat doesn't work either. I tried also setting the loops as a separate function, but that give more problems than solutions (granted I could have made a couple of mistakes trying it). Another issue is that the iteration indices are not a good way of saving data in the data_out variable, since they will be reset every iteration as I need to iterate over all the scans. That would overwrite the values already existing in, and the z index would be very slow-moving as it is the file counter.
Maybe someone has dealt with an issue like this and can shed some light? In case anyone has heard of it before, I am working with OpenBMap files. I have a working for loop iterator, but as it stands, the duration of each iteration just grows as more data is saved into the data_out array (as one would expect clearly). I could preallocate, but I don't really know the amount of datapoints that there will be after all the files are read.
Oh, as a side note, I made a separate parfor loop to convert all the XML files into structures in another test I did given that the profiler pointed to that particular function being the bottle-neck of the code, but the performance gain from doing so in the long run, wasn't as great as expect.