Skip to content

Issue with reading this xlsx file #2067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Lambik opened this issue May 6, 2021 · 6 comments
Closed

Issue with reading this xlsx file #2067

Lambik opened this issue May 6, 2021 · 6 comments

Comments

@Lambik
Copy link

Lambik commented May 6, 2021

This is:

- [x] a bug report

What is the expected behavior?

The xlsx file can be opened and used

What is the current behavior?

Loading the file yields the error Unable to identify a reader for this file

What are the steps to reproduce?

The file specified is attached to the issue: TRAINING.xlsx

<?php

use PhpOffice\PhpSpreadsheet\IOFactory;

require __DIR__ . '/vendor/autoload.php';

$excel = IOFactory::load('TRAINING.xlsx');

Which versions of PhpSpreadsheet and PHP are affected?

I'm on PHP 7.4 with PhpSpreadsheet 1.17.1 (latest at the time of the issue).

Cause

I've investigated myself, and the issue is that the xlsx file seems to use a different xml namespace than expected. It's all based on http://purl.oclc.org/ooxml/officeDocument/relationships instead of the usual http://schemas.openxmlformats.org/officeDocument/2006/relationships. This makes the function getWorkbookBaseName and the subsequent load fail.

I don't know how the author of the file created the file, they haven't told me yet (if/when I know, I'll update the issue).

@Lambik Lambik changed the title Issue with an xlsx file Issue with reading this xlsx file May 6, 2021
@oleibman
Copy link
Collaborator

oleibman commented May 6, 2021

Ouch. I am just starting to look at namespace prefixes, an effort which is not quite ready yet. However, my work on that area depends on having known constant namespaces, and your sample file (thank you for making it available) still won't work for that. I see that you have opened PR #2068. I will study it to see if I can integrate your changes and mine.

@Lambik
Copy link
Author

Lambik commented May 6, 2021

Glad to have helped, unfortunately I don't know if my PR covers all advanced cases in the new namespace (since I can't create a new file myself). All I know is that my changes work on the linked file, which is a good start..

@oleibman
Copy link
Collaborator

oleibman commented May 6, 2021

Agreed about "good start". My changes are already so complex that they will almost certainly happen incrementally in several stages, so I too will have "advanced cases which might not be covered". For now, I have placed your PR in draft status. If it looks like my work will take too long, I will return yours to ready status.

@oleibman
Copy link
Collaborator

oleibman commented May 9, 2021

Good news - I believe I have been able to integrate your chances with mine. I will push that change within the next few days.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue May 11, 2021
There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.
oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue Jun 19, 2021
This is a replacement for PHPOffice#2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.
MarkBaker pushed a commit that referenced this issue Jun 25, 2021
* Xlsx Reader Better Namespace Handling Phase 1 Try2

This is a replacement for #2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.
@oleibman
Copy link
Collaborator

oleibman commented Jul 7, 2021

@Lambik - have you been able to run tests against master to confirm that I have implemented your changes correctly?

@PowerKiKi
Copy link
Member

Should be solved in #2173

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants