Skip to content

File.readAsBytesSync() doesn't fully read all data #51071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkustermann opened this issue Jan 19, 2023 · 15 comments
Closed

File.readAsBytesSync() doesn't fully read all data #51071

mkustermann opened this issue Jan 19, 2023 · 15 comments
Assignees
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-io P1 A high priority bug; for example, a single project is unusable or has many test failures triaged Issue has been triaged by sub team type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@mkustermann
Copy link
Member

Running:

% dd if=/dev/zero of=8gb.bin bs=1073741824 count=8
% echo -n 'foobar' >> 8gb.bin
% import 'dart:typed_data';
import 'dart:convert';
import 'dart:io';

main() {
  final bytes = File('8gb.bin').readAsBytesSync();
  print(bytes.length);
  print(utf8.decode(Uint8List.sublistView(bytes, bytes.length - 7)));
}

should print

8589934598
foobar

but instead prints

2147479552

The issue is pretty obvious runtime/bin/file.cc

void FUNCTION_NAME(File_Read)(Dart_NativeArguments args) {
  ...
  int64_t bytes_read = file->Read(reinterpret_cast<void*>(buffer), length);
  if (bytes_read < 0) {
    Dart_SetReturnValue(args, DartUtils::NewDartOSError());
    return;
  }
  if (bytes_read < length) {
    ...
    Dart_Handle array_view = Dart_Invoke(io_lib, DartUtils::NewString("_makeUint8ListView"), kNumArgs, dart_args);
    Dart_SetReturnValue(args, array_view);
  }
  ...
}

The function assumes that file->Read(...) reads all that is possible, and if it reads less, the file may have been truncated (i.e. file size changed from length to something less).

Though file->Read(...) only issues a single read() call instead of a loop until everything is read:

int64_t File::Read(void* buffer, int64_t num_bytes) {
  ASSERT(handle_->fd() >= 0);
  return TEMP_FAILURE_RETRY(read(handle_->fd(), buffer, num_bytes));
}
@mkustermann mkustermann added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-io P1 A high priority bug; for example, a single project is unusable or has many test failures type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) labels Jan 19, 2023
@mkustermann
Copy link
Member Author

@aam Interested in fixing this?

@brianquinlan
Copy link
Contributor

I'm interested in fixing this and @aam is away.

@brianquinlan brianquinlan self-assigned this Jan 19, 2023
@brianquinlan brianquinlan added the triaged Issue has been triaged by sub team label Jan 19, 2023
@brianquinlan brianquinlan changed the title File.readBytesSync() doesn't fully read all data File.readAsBytesSync() doesn't fully read all data Jan 19, 2023
@brianquinlan
Copy link
Contributor

Looping until everything is read e.g.

int64_t File::Read(void* buffer, int64_t num_bytes) {
  ASSERT(handle_->fd() >= 0);

  int64_t num_bytes_read = 0;
  while (num_bytes_read < num_bytes) {
    // The behavior of `read(fildes, buf, nbyte)` where nbyte >= SSIZE_MAX is
    // implementation-defined by the POSIX standard. On Linux, up to SSIZE_MAX
    // bytes will be read.
    ssize_t result = TEMP_FAILURE_RETRY(
        read(handle_->fd(), reinterpret_cast<char*>(buffer) + num_bytes_read,
             num_bytes - num_bytes_read));
    if (result < 0) {
      return result;
    } else if (result == 0) {
      return num_bytes_read + result;
    } else {
      num_bytes_read += result;
    }
  }
  return num_bytes_read;
}

has the unintended effect of breaking our support for pipes i.e.

main() async {
  final pipe = await Pipe.create();
  int count = 0;
  pipe.write.add([1, 2, 3]);
  // Currently this will read the available bytes, generate the
  // listen event and exit cleanly.
  // With the above change, the call to `File::Read` will
  // spin forever waiting for data to appear.
  await pipe.read.listen((event) {
    print(event);
    pipe.write.close();
  });
}

There are other ways to fix this problem...

  1. we could always read the file in chunks in Dart code.
  2. we could add a allow_partial_reads parameter to File::Read()

For (1), we already do that for the async case:

@mkustermann
Copy link
Member Author

I'm interested in fixing this and @aam is away.

👍 Excellent!

we could always read the file in chunks in Dart code.

What's important for performance is that we avoid the extra step of combining several Uint8List chunks into a single one (extra copies). Ideally we also prefer as little Dart -> C++ calls as possible.

has the unintended effect of breaking our support for pipes i.e.

Could the caller of File::Read() do the repeated calls instead (it is given the length of bytes it's supposed to read from Dart)? Would that avoid the issue? (We do have File::ReadFully() which is different from File::Read())

we could add a allow_partial_reads parameter to File::Read()

Dart first gets the length of file and then instructs C++ to read that number of bytes. For pipes we don't know how many bytes are in the pipe, so distinguishing the two cases sounds reasonable.

@brianquinlan
Copy link
Contributor

Hmmm, I fixed this by modifying readAsBytesSync:

// The maximum number of bytes to read in a single call to `readSync`.
//
// On Windows and macOS, it is an error to call
// `read/_read(fildes, buf, nbyte)` with `nbyte >= INT_MAX`.
//
// The POSIX specification states that the behavior of `read` is
// implementation-defined if `nbyte > SSIZE_MAX`. On Linux, the `read` will
// transfer at most 0x7ffff000 bytes and return the number of bytes actually.
// transfered.
const int _maxReadSize = 2147483647;

  Uint8List readAsBytesSync() {
    var opened = openSync();
    try {
      var length = opened.lengthSync();
      var builder = new BytesBuilder(copy: false);
      if (length == 0) {
        // May be character device, try to read it in chunks.
        Uint8List data;
        do {
          data = opened.readSync(_blockSize);
          if (data.length > 0) {
            builder.add(data);
          }
        } while (data.length > 0);
      } else {
        // `readSync(bytes)` will over-allocate memory if `bytes` is greater
        // than the length of the file.
        //
        // `BytesBuilder` has an optimization where, if it contains a single
        // Uint8List, it does not copy memory in `takeBytes`.
        //
        // So the most efficient approach is to try to read the entire file
        // at once with a `bytes` not larger than the size of the file.
        Uint8List data;
        var bytesRemaining = length;
        do {
          data = opened.readSync(min(bytesRemaining, _maxReadSize));
          if (data.length > 0) {
            bytesRemaining -= data.length;
            builder.add(data);
          }
        } while (data.length > 0 && bytesRemaining > 0);
      }
      return builder.takeBytes();
    } finally {
      opened.closeSync();
    }
  }

But I didn't notice File::ReadFully(). Using File::ReadFully() would be more efficient than my approach in the case where length > _maxReadSize.

But File::ReadFully would need to be exposed to Dart code and ideally would need to be implemented per-platform to use the correct maximum read size. And I think that the use case is pretty niche.

OTOH, File::ReadFully is currently incorrect and using it would have the side effect of fixing that.

Unless you disagree, I'll try to fix/use File::ReadFully.

@aam
Copy link
Contributor

aam commented Jan 20, 2023

Isn't the issue actually with https://github.com/dart-lang/sdk/blob/main/runtime/platform/utils_linux.cc#L46

size_t Utils::Read(int filedes, void* buf, size_t nbyte) {
  return read(filedes, buf, nbyte);
}

and similar posix implementations where it won't read more than 2,147,479,552 bytes at a time(https://man7.org/linux/man-pages/man2/read.2.html)?

I don't see similar limitation to the _read for Windows implementation of Utils::Read (https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/read?view=msvc-170)

@mkustermann
Copy link
Member Author

If it's helpful you could also have two natives - one to read certain number of bytes (or less if EOF comes before that) and another that reads whatever is available in one read operation (pipe situation).

builder.add(data)

This may cause copies. A dart-only solution could allocate one Uint8List and use repeated calls to RandomAccessFile.readIntoSync(bytes, offset, length) to fill it.

@brianquinlan
Copy link
Contributor

@aam, yes the issue is that there is a max read size.

@mkustermann I was aware of the copy but it would only occur if the file is >2GB in length so I wasn't really worried about it. But readIntoSync is better. Still, I'll probably use ReadFully because that needs fixed anyway.

@aam
Copy link
Contributor

aam commented Jan 20, 2023

@aam, yes the issue is that there is a max read size.

Could we perhaps then adjust Utils::Read function on posix so it loops if it reads that "max read" number of bytes, less than nbyte value?

@brianquinlan
Copy link
Contributor

brianquinlan commented Jan 20, 2023

No, as I said in #51071 (comment), that will break files with no length so as pipes.

I think that fixing and then using ReadFully is a reasonable approach so I'm going to go with that.

@aam
Copy link
Contributor

aam commented Jan 20, 2023

No, as I said in #51071 (comment), that will break files with no length so as pipes.

I was thinking specifically about handling 2,147,479,552 as number of bytes read as it indicates that the internal posix read limit is hit.

@brianquinlan
Copy link
Contributor

ReadFully has slightly different semantics than readAsBytesSync:

  • with ReadFully, it is an error if the full length cannot be read.
  • with readAsBytesSync, just the readable bytes will be returned i.e. it is not an error if the file is truncated.

Someone went to a bunch of effort to make readAsBytesSync so I don't want to change it.

@brianquinlan
Copy link
Contributor

I have a change out for review here: https://dart-review.googlesource.com/c/sdk/+/279204

copybara-service bot pushed a commit that referenced this issue Jan 31, 2023
Bug: #51071
Change-Id: Ia64d803c9709b106e52a1c671c1c3288c051bd85
Tested: ci + new test
CoreLibraryReviewExempt: bug fix only for vm
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279204
Reviewed-by: Alexander Aprelev <[email protected]>
Reviewed-by: Martin Kustermann <[email protected]>
Commit-Queue: Brian Quinlan <[email protected]>
@brianquinlan
Copy link
Contributor

Fixed in 252015b

@vidhi002
Copy link

@aam, yes the issue is that there is a max read size.

@mkustermann I was aware of the copy but it would only occur if the file is >2GB in length so I wasn't really worried about it. But readIntoSync is better. Still, I'll probably use ReadFully because that needs fixed anyway.

can you give me example of ReadFully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-io P1 A high priority bug; for example, a single project is unusable or has many test failures triaged Issue has been triaged by sub team type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

4 participants