Skip to content

csv: remove old workaround in UTF8Reader #546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,16 @@ public final class UTF8Reader

private byte[] _inputBuffer;

/**
* Flag set to indicate {@code inputBuffer} is read-only, and its
* content should not be modified. This is the case when caller
* has passed in a buffer of contents already read, instead of Jackson
* allocating read buffer.
*
* @since 2.19
*/
private final boolean _inputBufferReadOnly;

/**
* Pointer to the next available byte (if any), iff less than
* <code>mByteBufferEnd</code>
Expand Down Expand Up @@ -73,7 +83,10 @@ public UTF8Reader(IOContext ctxt, InputStream in, boolean autoClose,
_inputBuffer = buf;
_inputPtr = ptr;
_inputEnd = ptr+len;
_autoClose = autoClose;
_autoClose = autoClose;
// Unmodifiable if there is no stream to actually read from
// (ideally caller should pass explicitly)
_inputBufferReadOnly = (in == null);
}

public UTF8Reader(IOContext ctxt, byte[] buf, int ptr, int len)
Expand All @@ -85,6 +98,8 @@ public UTF8Reader(IOContext ctxt, byte[] buf, int ptr, int len)
_inputPtr = ptr;
_inputEnd = ptr+len;
_autoClose = true;
// This is the case when we have a buffer of contents already read
_inputBufferReadOnly = true;
}

public UTF8Reader(IOContext ctxt, InputStream in, boolean autoClose)
Expand All @@ -96,15 +111,8 @@ public UTF8Reader(IOContext ctxt, InputStream in, boolean autoClose)
_inputPtr = 0;
_inputEnd = 0;
_autoClose = autoClose;
}

/**
* Method that can be used to see if we can actually modify the
* underlying buffer. This is the case if we are managing the buffer,
* but not if it was just given to us.
*/
protected final boolean canModifyBuffer() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although protected, not usable from outside, so can just remove.

return (_ioContext != null);
// Buffer allocated above, modifiable as needed
_inputBufferReadOnly = false;
}

/*
Expand Down Expand Up @@ -400,27 +408,17 @@ private boolean loadMore(int available) throws IOException
{
_byteCount += (_inputEnd - available);

// Bytes that need to be moved to the beginning of buffer?
if (available > 0) {
// Should we move bytes to the beginning of buffer?
if (_inputPtr > 0) {
if (!canModifyBuffer()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you feel this is no longer needed? This is to prevent reader from modifying buffer caller gave, in case moving would occur.

If anything, looks like it should avoid copying even if _inputSource was not null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. It'd solve #497?

We do need to avoid modifying buffer, however, I don't think this PR quite works in that sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test pass without this check. The existing tests and we gain an extra working test (the 497 test).

I don't know how to rewrite this because I don't have a broken test to work with. My main aim is not to have the 497 test (which is a valid test) fail. I'm not against revisiting this but since this change breaks no tests (other than minor woring diff in exception message on 1 test) and fixes one broken test, I'd prefer to remove this check for now. If this was merged, I would follow up and remove the same check in the TOML and YAML code bases because there is no proof that this check helps there either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have a test to verify underlying buffer is not modified when it shouldn't be -- which is what check does -- but even without that we should not make change that will now unexpectedly do that.

Specifically this is the case where caller passes byte[] containing UTF-8 content: expectation is that this buffer is not modified. And as such it should not be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ok, so I extended the tests to see if array input gets modified and it does
  • is there anyway that the parser could be modified to clone the array when we hit this case and to work with the cloned array - to avoid modifying the input array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added back the code check for whether the buffer is modifiable - but instead of throwing an exception, I clone the buffer to avoid modifying the original. The cloning only happens in rare cases but I think it is worth the runtime hit to fix the broken test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and used the idea, just rewrote slightly. Thank you for test to verify buffer not modified!

Ideally wouldn't need to move things around, will see what I can do there. But at least I can go ahead and merge this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can just remove copying action altogether.

// 15-Aug-2022, tatu: Occurs (only) if we have half-decoded UTF-8
// characters; uncovered by:
//
// https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=50036
//
// and need to be reported as IOException
if (_inputSource == null) {
throw new IOException(String.format(
"End-of-input after first %d byte(s) of a UTF-8 character: needed at least one more",
available));
// Can only do so if buffer mutable
if (!_inputBufferReadOnly) {
for (int i = 0; i < available; ++i) {
_inputBuffer[i] = _inputBuffer[_inputPtr+i];
}
_inputPtr = 0;
_inputEnd = available;
}
for (int i = 0; i < available; ++i) {
_inputBuffer[i] = _inputBuffer[_inputPtr+i];
}
_inputPtr = 0;
_inputEnd = available;
}
} else {
// Ok; here we can actually reasonably expect an EOF, so let's do a separate read right away:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,19 @@
import org.junit.jupiter.api.Test;

import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.ModuleTestBase;

import static org.assertj.core.api.Assertions.fail;
import static org.junit.jupiter.api.Assertions.assertArrayEquals;

/**
* Collection of OSS-Fuzz found issues for CSV format module.
*/
public class FuzzCSVReadTest extends StreamingCSVReadTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why base class was chosen that way; was wrong. :)

public class FuzzCSVReadTest extends ModuleTestBase
{
private final CsvMapper CSV_MAPPER = mapperForCsv();
private final byte[] INPUT = new byte[] { 0x20, (byte) 0xCD };
private final byte[] CLONED = INPUT.clone();

// https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=50036
@Test
Expand All @@ -25,8 +28,9 @@ public void testUTF8Decoding50036() throws Exception
CSV_MAPPER.readTree(INPUT);
fail("Should not pass");
} catch (IOException e) {
verifyException(e, "End-of-input after first 1 byte");
verifyException(e, "of a UTF-8 character");
verifyException(e, "Unexpected EOF in the middle of a multi-byte UTF-8 character");
// check input was not modified
assertArrayEquals(CLONED, INPUT);
}
}

Expand All @@ -38,6 +42,8 @@ public void testUTF8Decoding50036Stream() throws Exception
fail("Should not pass");
} catch (IOException e) {
verifyException(e, "Unexpected EOF in the middle of a multi-byte UTF-8 character");
// check input was not modified
assertArrayEquals(CLONED, INPUT);
}
}
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package com.fasterxml.jackson.dataformat.csv.tofix;
package com.fasterxml.jackson.dataformat.csv.deser;

import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
Expand All @@ -8,7 +8,6 @@
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.ModuleTestBase;
import com.fasterxml.jackson.dataformat.csv.testutil.failure.JacksonTestFailureExpected;

import static org.junit.jupiter.api.Assertions.*;

Expand All @@ -18,7 +17,6 @@ public class UnicodeCSVRead497Test extends ModuleTestBase
private final CsvMapper MAPPER = mapperForCsv();

// [dataformats-text#497]
@JacksonTestFailureExpected
@Test
public void testUnicodeAtEnd() throws Exception
{
Expand All @@ -35,12 +33,15 @@ public void testUnicodeAtEnd() throws Exception
public void testUnicodeAtEnd2() throws Exception
{
String doc = buildTestString2();
final byte[] bytes = doc.getBytes(StandardCharsets.UTF_8);
JsonNode o = MAPPER.reader() //.with(schema)
.readTree(doc.getBytes(StandardCharsets.UTF_8));
.readTree(bytes);
assertNotNull(o);
assertTrue(o.isArray());
assertEquals(1, o.size());
assertEquals(o.get(0).textValue(), doc);
// check byte array was not modified
assertArrayEquals(doc.getBytes(StandardCharsets.UTF_8), bytes);
}

@Test
Expand Down