-
Notifications
You must be signed in to change notification settings - Fork 3.6k
simplexml_load_string / Opening and ending tag mismatch in XLSX-Reader #3125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See PR #1181, which seems to be an attempt to fix the same problem as here (br without end tag in vml). Solution proposed was deemed too risky and the PR was closed. Perhaps someone can suggest a less risky approach. In the meantime, it was suggested in that PR that problem would go away with Excel releases newer than 2013. Is the file that demonstrates this problem produced by a version of Excel newer than 2013? |
The same problem was also reported in closed issue #170, with a possible solution in emeraldjava@8e390bc |
currently 2019/365 is used, previously 2016. I don't know wich version the file was originally created with in the past. |
Fix PHPOffice#3125 (rejected previous PR PHPOffice#1181 as too risky, issue was also reported as PHPOffice#170). Vml file should be valid Xml, but Excel (possibly just Excel 2013) can generate unclosed `<br>` tags, preventing Xlsx reader from reading file correctly. I believe a very narrowly targeted fix, changing `<br>` to `<br/>`, and only when reading vml files, probably mitigates the risk. I am going to leave this PR as a draft for a while to see if others disagree. Another reason to leave it as a draft is the absence of a decent test file. I am able to copy an existing test spreadsheet which contains a vml file, and add some text and the problematic tag, and I can confirm that it fails to load correctly without the fix but loads correctly with the fix. However, it would be better if I had a real file. I cannot figure out how to generate a file like this "naturally"; when I add a textbox to a spreadsheet, it is stored as a regular xml file, not as vml, in both Excel 365 and 2007. Also, the text that I added in the vml file doesn't show anywhere when I open the file in Excel, so I don't know whether my file is a normal case for this condition. And, incidentally, when I load the file in PhpSpreadsheet and save it, the textbox data is not in the resulting output file; this is clearly not a problem in the particular case I constructed, but I don't know if that's true in general. If not, that would be a different problem than the one I'm fixing, but I would prefer to resolve them both at the same time.
PR 3127 probably fixes this. I've left it as a draft because it's very hard to come to a conclusion without a suitable test file. Is there any way you can upload one? Failing that, it might (not certain) be sufficient if you can at least test against that PR with your test file. |
I can't change files that a user created, so it look that office create not 100% xml compatible files in some cases.
So I don't need to change phpSpreadsheet files. |
That's an interesting workaround; I'm glad you found it. I'm going to continue to investigate anyhow. |
This is a replacement for draft PR PHPOffice#2455 and draft PR PHPOffice#3127. There is some useful commentary in those PRs which I have mostly, but not entirely, duplicated below. Fix PHPOffice#2396. Fix PHPOffice#1770. Fix PHPOffice#2388. A related problem is that the vml files used for the form controls sometimes contain invalid xml. Fix PHPOffice#3125 (rejected previous PR PHPOffice#1181 as too risky, issue was also reported as PHPOffice#170). Vml file should be valid Xml, but Excel can generate unclosed `<br>` tags, preventing Xlsx reader from reading file correctly. I believe a very narrowly targeted fix, changing `<br>` to `<br/>`, and only when reading vml files, probably mitigates the risk. The sample file formscomments.xlsx which is part of this change shows this problem with `Button 1` on sheet `Forms`; the spreadsheet was created with Excel 365, so the problem is not restricted to Excel 2013 as originally reported. A comment on PR 3127 indicates that other tags might be involved, but, without a file demonstrating that, I will restrict this change to br tags for now. I am starting this out in draft status, and will probably leave it that way for some time. I'm not sure where we want to go with this. It fixes some problems, but in a limited manner, and creates some others. I'm not sure the pain of the others is balanced considering the limitations of the fix. If enough interest is generated as a result of this ticket being out there, we can proceed; if not, it probably isn't worth it. This fix allows form control elements to be read in and written out. It does not allow you to add such elements, nor even to locate them or determine their properties (so you can't modify or delete them). Although it handles reading and writing of sheets containing both form controls and comments, it will probably create a corrupt spreadsheet if you try adding a new comment to a sheet with form controls - probably quite difficult to solve. Cloning the sheet probably won't work either - probably easier than the other. It is conceivable that we want to add a new property to the Xlsx Reader which turns the reading of form elements on or off (default=off), so that negative effects will be limited to those who have explictly opted in. The change in its current form does not implement such a property. Because of its limitations, the change isn't really testable. As in some other recent installs, I have added a sample to demonstrate that it works correctly. As it turns out, if we have a worksheet which contains both form controls and comments (see formscomments.xlsx which is part of this PR), PhpSpreadsheet already creates a corrupt file when it tries to load and save the spreadsheet with such a worksheet. With this change, the file is saved without corruption. This tilts things in favor of proceeding. I'm still not ready, but this will be an important consideration. A sample file for issue PHPOffice#2621 illustrated a problem with shape files. Since they are involved here, I took a look at how the sample worked with this code. In master, and with this change, a corrupt file results. Fixing that is probably easier than the general problem of handling shape files, but it's an argument against moving this forward until the corruption problem can be addressed. Fix PHPOffice#2661. A template including checkboxes was leading to file corruption solved by this PR. Another argument for moving forward.
The callback was provided so that customised changes like this could be made to the markup, and they are often specific to different situations and files; but they are always the responsibility of the developer and not of the PhpSpreadsheet project itself |
Yes, the change to the XML comes from Microsoft - Excel though. So I think that a central code adjustment is necessary. I'm not sure if tidy is the perfect solution for this problem. Possibly there is also a parameter for simplexml that doesn't take it so precisely with closing tags. :-) In the meantime I also had other files in which not the |
* WIP Limited Support for Form Controls V2 (ListBox, Buttons, etc.) This is a replacement for draft PR #2455 and draft PR #3127. There is some useful commentary in those PRs which I have mostly, but not entirely, duplicated below. Fix #2396. Fix #1770. Fix #2388. A related problem is that the vml files used for the form controls sometimes contain invalid xml. Fix #3125 (rejected previous PR #1181 as too risky, issue was also reported as #170). Vml file should be valid Xml, but Excel can generate unclosed `<br>` tags, preventing Xlsx reader from reading file correctly. I believe a very narrowly targeted fix, changing `<br>` to `<br/>`, and only when reading vml files, probably mitigates the risk. The sample file formscomments.xlsx which is part of this change shows this problem with `Button 1` on sheet `Forms`; the spreadsheet was created with Excel 365, so the problem is not restricted to Excel 2013 as originally reported. A comment on PR 3127 indicates that other tags might be involved, but, without a file demonstrating that, I will restrict this change to br tags for now. I am starting this out in draft status, and will probably leave it that way for some time. I'm not sure where we want to go with this. It fixes some problems, but in a limited manner, and creates some others. I'm not sure the pain of the others is balanced considering the limitations of the fix. If enough interest is generated as a result of this ticket being out there, we can proceed; if not, it probably isn't worth it. This fix allows form control elements to be read in and written out. It does not allow you to add such elements, nor even to locate them or determine their properties (so you can't modify or delete them). Although it handles reading and writing of sheets containing both form controls and comments, it will probably create a corrupt spreadsheet if you try adding a new comment to a sheet with form controls - probably quite difficult to solve. Cloning the sheet probably won't work either - probably easier than the other. It is conceivable that we want to add a new property to the Xlsx Reader which turns the reading of form elements on or off (default=off), so that negative effects will be limited to those who have explictly opted in. The change in its current form does not implement such a property. Because of its limitations, the change isn't really testable. As in some other recent installs, I have added a sample to demonstrate that it works correctly. As it turns out, if we have a worksheet which contains both form controls and comments (see formscomments.xlsx which is part of this PR), PhpSpreadsheet already creates a corrupt file when it tries to load and save the spreadsheet with such a worksheet. With this change, the file is saved without corruption. This tilts things in favor of proceeding. I'm still not ready, but this will be an important consideration. A sample file for issue #2621 illustrated a problem with shape files. Since they are involved here, I took a look at how the sample worked with this code. In master, and with this change, a corrupt file results. Fixing that is probably easier than the general problem of handling shape files, but it's an argument against moving this forward until the corruption problem can be addressed. Fix #2661. A template including checkboxes was leading to file corruption solved by this PR. Another argument for moving forward. * Improved Sample File, and Documentation Add more realistic worksheet to spreadsheet. Document new feature, adding caveats to how it can be used.
This is:
What is the expected behavior?
no warning
What is the current behavior?
simplexml_load_string(): Entity: line 59: parser error : Opening and ending tag mismatch: br line 58 and font
What are the steps to reproduce?
Sorry, can't distribute xlsx-file
What features do you think are causing the issue
Does an issue affect all spreadsheet file formats? If not, which formats are affected?
Which versions of PhpSpreadsheet and PHP are affected?
newest
The text was updated successfully, but these errors were encountered: