Convert a table in a pdf to a MATLAB cell structure
Show older comments
I have a pdf file that contains an Nx9 table of data that I need to turn into a matlab cell structure of an excel file. Some of the (row,column) entries are blank.
So far, I have tried reading the pdf using:
txt = extractFileText('filename.pdf');
This produces a 1x1 string file with multiple spaces breaking up rows in a seemingly random order. The (row,column) combinations do not appear in a logical position in txt. Is there another command that can read a PDF table?
4 Comments
dpb
on 30 Jan 2021
Unfortunately, the PDF format has no internal representation of a table structure -- a table is simply text. What looks like a table in a document is only a simulation done by placing the words/numbers as they would appear in a spreadsheet.
There apparently are some scraping utilities on the web, many seem to be in Python utilizing ghostscript underneath (since PDF is built on top of PostScript, that's only natural).
A quick DAGS didn't uncover any MATLAB-specific solutions although there well may be. I didn't search Yair Altman's site, for example; seems like something he might have at least talked about at some time.
Stephen23
on 30 Jan 2021
Not only that, but because the PDF standard allows for all sorts of vertical and horizontal offsets and lots of other tricks the actual order of data stored in the file is in no way guaranteed to represent the order of that data displayed when the PDF is rendered. The only way to know how the data is arranged on the rendered page is to render that page.
dpb
on 30 Jan 2021
Which is, it seems, what the scraping utilities do...get the boundaries of the table as rendered and then suck that area up.
Sim
on 12 Mar 2023
The following function is not really helpful when a PDFs contains tables with blank cells:
txt = extractFileText('filename.pdf');
Has a new tool been created in the meantime, i.e. between January 2021 and today, middle of March 2023 ?
Answers (2)
the cyclist
on 12 Mar 2023
0 votes
2 Comments
the cyclist
on 13 Mar 2023
I've haven't used it for data that I would have privacy concerns about, but I think there are strong reasons to believe it is safe:
- It's open-source, so you can see all the code on github
- It doesn't seem to send your data anywhere else. Although it might seem like it is sending your data to a web site, it looks to me like it only opens a local browser window.
- It was first built by journalists, who tend to care about privacy (at least of their own data!)
Suraj
on 29 Mar 2023
0 votes
Hi Charles
Hope this helps.
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!