ExtractMetadata becomes non-responsive after individual error #52

jnathanh · 2022-03-26T23:14:14Z

Hi, thanks for sharing this package. I ran into an error case trying to do a filepath.Walk over my photos directory. Manually skipping directories does avoid this issue, but it seems like the wrong behavior for ExtractMetadata to silently start giving incorrect results as a result of a previous call. I'd like to incorporate this package into some of my workflows, so I'm raising the issue to help improve the package's usability/reliability. Here is a test that reproduces the behavior:

func TestNonResponsiveExiftool(t *testing.T) {
	t.Parallel()

	// create test directory with about 200MB worth of photos
	// 10,000 of the included jpg or 6 30MB CR2's both repro
	dirPath := "./testdata/extractkiller"

	err := os.Mkdir(dirPath, 0744)
	require.NoError(t, err)
	defer os.RemoveAll(dirPath)

	for i:=0;i < 10_000;i++ {
		err := copyFile("./testdata/20190404_131804.jpg", path.Join(dirPath, fmt.Sprintf("%d.jpg", i)))
		require.NoError(t, err)
	}


	// initialize exiftool
	e, err := NewExiftool(Debug())
	require.NoError(t, err)
	defer e.Close()


	// control case, everything working normally
	f := e.ExtractMetadata("./testdata/20190404_131804.jpg")
	assert.Equal(t, 1, len(f))
	assert.NoError(t, f[0].Err)

	// case that breaks the session, reading a directory does not seem to be supported by this package, which is okay, and I would expect it to throw an error in that case, but it should not make future reads invalid.
	f = e.ExtractMetadata(dirPath)
	assert.Error(t, f[0].Err)

	// control case no longer works
	f = e.ExtractMetadata("./testdata/20190404_131804.jpg")
	assert.Equal(t, 1, len(f))
	assert.NoError(t, f[0].Err)

}

I suspect the issue is either a race (scanning output before it's finished), or improper buffer usage (overflow maybe?). I ran the same test with exiftool directly and got output right away without issue.

One other note, it looks like exiftool supports paths to a directory as an input, or even a list of multiple files. It might be worth reworking the design to let exiftool handle the work of managing multiple paths (directories or files), and add support for parsing the multi-file output.

I'll try to work on a PR for this, but posting the issue now as a heads up, and to see if you have any suggestions on the right design to fix the issue.

barasher · 2022-03-28T06:44:57Z

Hi @jnathanh !

Since providing a folder makes the library bug, I think that the first thing to do is to reject folder paths.

I'm not convinced that handling folders is usefull. As a CLI it is (for exiftool), but as a library, if it does not increase performance, I think that keeping the library as simple as possible is better. What to you think ?

jnathanh · 2022-03-29T06:46:42Z

@barasher I agree it's best to stick with the simplest fix, and I don't personally need to pass the directory so I think your first instinct to leave out directory support for now is probably the right call.

I did a little more digging and found the root issue, the buffer is getting exhausted. In ExtractMetadata, when the Scan() returns false, we currently assume there are no errors and continue (not reporting the error and leaving the Scanner in a bad state). I happened to find it via loading multiple files (from a directory), but it could theoretically manifest for a single large metadata file.

I started on a fix, but it gets messy without breaking changes to the api. The obvious change is to report actual scan errors first before the "no content" error. What is not as straight-forward is how to address the bad scanner state. Here are the different options I have explored:

replace bufio.Scanner with bufio.Reader (or really any io.Reader implementation)
- The bufio package recommends using bufio.Reader for the sequential read use case (see the end of the description for Scanner). We can see why with the scanner state issue we are running into here.
- This breaks the Buffer() option (how important is the use case for this option?)
Reset the scanner
- Breaks the Buffer() option, or you have to add a bunch of logic to check if the buffer was configured and somehow clear that byte array
Exit early and report a bad client state
- panic (kind of heavy-handed)
- add an error return value to the signature (breaking change)
- depend on the file-level error (this does not make it clear the client is no longer usable)

I suggest option 1 for the more reliable design, or option 3 for the simplest solution (and least api disruption).

(#52 fix)

barasher · 2022-03-29T20:02:51Z

Thanks for the analysis !

Backward compatibility is very important to me. I hate when libraries I use have breaking changes in their API :p
Since I think that handling a directory is not that useful, I've introduced a new sentinel error ErrNotFile that can be return when invoking extractMetadata and I've chosen to implement the option 3 you proposed.

Here is the PR : #55

What do you think ?

jnathanh · 2022-03-30T05:46:57Z

@barasher the directory check seems like a sensible feature given this library's lack of support for directories. However, it only addresses a symptom of the issue I was trying to raise. You could also trigger the same issue with a single file, if it is big enough to fill the scanner's buffer (see test below). This is likely an edge case since metadata files are usually small (at least the photo metadata I'm used to working with). So, you may want to leave it unresolved for now, however, I want to make sure you're aware that is the decision you are making by only addressing the issue with the directory check.

func TestLargeMetadata(t *testing.T) {
	t.Parallel()

	bigFile, cleanup := createFileLargerThan(t, bufio.MaxScanTokenSize)
	defer cleanup()

	e, err := NewExiftool()
	require.NoError(t, err)
	defer e.Close()

	// big file should not error, but if it does, it should output the actual source of the error
	bigResults := e.ExtractMetadata(bigFile)
	require.Len(t, bigResults, 1)
	assert.NoError(t, bigResults[0].Err)

	// subsequent small file should also not be affected by previous file reads
	smallResults := e.ExtractMetadata("testdata/20190404_131804.jpg")
	require.Len(t, smallResults, 1)
	assert.NoError(t, smallResults[0].Err)
}

func createFileLargerThan(t *testing.T, byteCount int) (path string, cleanup func()) {
	t.Helper()

	dirPath := t.TempDir()
	cleanup = func() { os.RemoveAll(dirPath) }

	path = filepath.Join(dirPath, "20190404_131804.jpg")
	require.Nil(t, copyFile("testdata/20190404_131804.jpg", path))

	e, err := NewExiftool()
	require.NoError(t, err)
	defer e.Close()

	mds := e.ExtractMetadata(path)
	require.Len(t, mds, 1)
	require.NoError(t, mds[0].Err)

	valueLargerThanBuffer := ""
	for i := 0; i < byteCount+1; i++ {
		valueLargerThanBuffer += "x"
	}

	mds[0].SetString("Comment", valueLargerThanBuffer)
	mds[0].Err = nil

	e.WriteMetadata(mds)
	require.NoError(t, mds[0].Err)

	return
}

barasher · 2022-03-30T20:29:22Z

I totally agree.

Nevertheless, currently, the Buffer(...) option can be used to customize go-exiftool's buffer. This way, we can extract any "amount" of metadata.

The only advantage I see using a bufio.Reader instead of a bufio.Scanner is the "streaming" (gradually) consumption of the buffer. This way, we might need a smaller buffer but the implementation will be really tricky.

jnathanh · 2022-03-31T04:19:15Z

Ya, I think using a larger buffer is an acceptable solution. The problem is that a new user, using the program defaults, would never find out that is the issue without extensive debugging. Ideally the program should report that the buffer filled up so a user know they need to use a larger buffer. Even worse, the output actually reports that the metadata was empty, so without investigating further, a user would think that there was just not metadata in the file.

Based on your feedback so far, I suggest adding a panic. That would make it clear what the issue is and not report back incorrect metadata results.

* Error ErrNotFile returned when extracting metadata from folder (#52 fix) * Test simplification

barasher · 2022-04-10T20:05:26Z

I created #56 to migrate from Scanner to Reader (and to return a cleaner response).
Anyway, thanks a lot for the bug report !
Fix will be released in v1.7.1

barasher added the bug Something isn't working label Mar 28, 2022

barasher added this to the v1.7.1 milestone Mar 28, 2022

barasher added a commit that referenced this issue Mar 29, 2022

Error ErrNotFile returned when extracting metadata from folder

9b78882

(#52 fix)

barasher mentioned this issue Mar 29, 2022

Error ErrNotFile returned when extracting metadata from folder #55

Merged

barasher added a commit that referenced this issue Apr 10, 2022

Error ErrNotFile returned when extracting metadata from folder (#55)

c725199

* Error ErrNotFile returned when extracting metadata from folder (#52 fix) * Test simplification

barasher mentioned this issue Apr 10, 2022

Migrate from bufio.Scanner to bufio.Reader #56

Open

barasher closed this as completed Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtractMetadata becomes non-responsive after individual error #52

ExtractMetadata becomes non-responsive after individual error #52

jnathanh commented Mar 26, 2022

barasher commented Mar 28, 2022

jnathanh commented Mar 29, 2022

barasher commented Mar 29, 2022

jnathanh commented Mar 30, 2022

barasher commented Mar 30, 2022

jnathanh commented Mar 31, 2022

barasher commented Apr 10, 2022

ExtractMetadata becomes non-responsive after individual error #52

ExtractMetadata becomes non-responsive after individual error #52

Comments

jnathanh commented Mar 26, 2022

barasher commented Mar 28, 2022

jnathanh commented Mar 29, 2022

barasher commented Mar 29, 2022

jnathanh commented Mar 30, 2022

barasher commented Mar 30, 2022

jnathanh commented Mar 31, 2022

barasher commented Apr 10, 2022