Skip to content

Reoccurrence of 16023 - 403 Failure Performing file_get_contents on URL With No Restrictions #17121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
oleibman opened this issue Dec 11, 2024 · 15 comments

Comments

@oleibman
Copy link

oleibman commented Dec 11, 2024

Description

The is a repeat of issue #16023. The solution suggested when that issue was closed worked successfully from Sept. 24 through Dec. 10. It no longer works as of Dec. 11 on github (see https://github.com/PHPOffice/PhpSpreadsheet/actions/runs/12277531354/job/34257128156?pr=4272). I have tried using TLSv1_3 as well as TLSv1_2.

I cannot reproduce the problem on a local Windows or Linux machine.

The following code:

<?php
$ctx = stream_context_create(["ssl" => ["crypto_method"=>STREAM_CRYPTO_METHOD_TLSv1_2_CLIENT]]);
file_get_contents('https://phpspreadsheet.readthedocs.io/en/latest/topics/images/01-03-filter-icon-1.png', false, $ctx)

Resulted in this output:

Exception: file_get_contents(https://phpspreadsheet.readthedocs.io/en/latest/topics/images/01-03-filter-icon-1.png): Failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden

But I expected this output instead:

// no warning/error message

PHP Version

PHP 8.3.14, PHP 8.1.31, PHP 8.4.1

Operating System

Ubuntu 22.04.5 LTS

@cmb69
Copy link
Member

cmb69 commented Dec 11, 2024

Have you contacted readthedocs.io? Possibly they limit automated downloads in some way.

@oleibman
Copy link
Author

I will, but last time they said they didn't do anything and sent me to you ... which worked. The fact that I can't get it to fail locally is worrisome.

@oleibman
Copy link
Author

FWIW, I created a PR which uses curl rather than file_get_contents for https. That PR was successful. I don't know that adding a requirement for curl is something we want to do. PHPOffice/PhpSpreadsheet#4274

@cmb69
Copy link
Member

cmb69 commented Dec 11, 2024

That 403 hints at something that is actively blocked by the server; possibly too many accesses from a certain IP, or generally they block some IP range. Or maybe it's the missing user_agent.

@oleibman
Copy link
Author

I have opened readthedocs/readthedocs.org#11845 with readthedocs.

@cmb69
Copy link
Member

cmb69 commented Dec 11, 2024

Okay, let's see what comes out of that report. I'm switching to "need feedback", so this ticket will be open for at least two weeks.

@lucasnetau
Copy link
Contributor

lucasnetau commented Dec 11, 2024

The readthedocs.org website is behind Cloudflare CDN. Typically I see the CloudFlare WAF blocking requests that look like bots.

I find the following helps to make your request look more like a web browser

  • Set a user agent for the request via the ini option or via a stream context passed to the file_get_contents call ini_set('user_agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'); to set to the current Chrome browser, php production ini has no user_agent set by default

  • Set the Connection header to 'Connection: keep-alive' (Even if you close the connection straight away)

  • Set an Accept header (Copy the value Chrome users) (Most bots will not send this header)

@oleibman
Copy link
Author

Adding user_agent seems to work. I am studying what is needed for Connection header and Accept header.

@oleibman
Copy link
Author

Adding Connection and Accept headers did no harm.

@oleibman
Copy link
Author

Adding Connection and Accept headers correctly (thanks to comment from @lucasnetau) also did no harm.

@oleibman
Copy link
Author

An interesting consequence of the Accept header that Chrome uses. An image which I expected to be downloaded as png was instead downloaded as webp. Removing image/webp from the Accept header gets the expected result.

@oleibman
Copy link
Author

I have a PR ready to go. I will wait a day or two in case anyone thinks of something else. Here is my new code:

                $ctx = stream_context_create([
                    'ssl' => ['crypto_method' => STREAM_CRYPTO_METHOD_TLSv1_3_CLIENT],
                    'http' => [
                        'user_agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
                        'header' => [
                            'Connection: keep-alive',
                            // accept header used by chrome without image/webp
                            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
                        ],
                    ],
                ]);

@oleibman
Copy link
Author

Connection: keep-alive seems to add an unacceptable performance lag here. When I read the file without that option, it completes in about 0.2 seconds. When I add that option, it takes 2 minutes to complete. Is that expected? If so, what do you think the adverse consequences might be if I omitted that parameter?

@cmb69
Copy link
Member

cmb69 commented Dec 13, 2024

I don't know. You have to ask the readthedocs people, or possibly Cloudflare. Maybe check what a successful cURL connection sends (CURLOPT_VERBOSE), and reduce to the bare minimum to make it work for plain PHP HTTP requests. And be prepared that you may need to readjust if they increase/change their blocking in the future. Maybe they will expect HTTP/2 requests, and these are unlikely to be supported by HTTP wrappers anytime soon.

Anyhow, this is not a bug in php-src.

@cmb69 cmb69 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2024
@oleibman
Copy link
Author

Here is the final version of the code as I have implemented it:

            $ctx = null;
            // https://github.com/php/php-src/issues/16023
            // https://github.com/php/php-src/issues/17121
            if (str_starts_with($path, 'https:') || str_starts_with($path, 'http:')) {
                $ctxArray = [
                    'http' => [
                        'user_agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
                        'header' => [
                            //'Connection: keep-alive', // unacceptable performance
                            'Accept: image/*;q=0.9,*/*;q=0.8',
                        ],
                    ],
                ];
                if (str_starts_with($path, 'https:')) {
                    $ctxArray['ssl'] = ['crypto_method' => STREAM_CRYPTO_METHOD_TLSv1_3_CLIENT];
                }
                $ctx = stream_context_create($ctxArray);
            }
            $imageContents = @file_get_contents($path, false, $ctx);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants