Skip to content

Convolutional net gives different results under ML.NET and Keras/TensorFlow #2022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mareklinka opened this issue Jan 4, 2019 · 11 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@mareklinka
Copy link
Contributor

System information

  • OS version/distro: Windows 10 1809
  • .NET Version (eg., dotnet --info):
    .NET Core SDK (reflecting any global.json):
    Version: 2.2.100
    Commit: b9f2fa0ca8

Runtime Environment:
OS Name: Windows
OS Version: 10.0.17763
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\2.2.100\

Host (useful for support):
Version: 2.2.0
Commit: 1249f08fed

.NET Core SDKs installed:
2.1.202 [C:\Program Files\dotnet\sdk]
2.1.500 [C:\Program Files\dotnet\sdk]
2.1.502 [C:\Program Files\dotnet\sdk]
2.2.100 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.0.9 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]

Issue

I'm facing an issue while trying to score my Keras-trained, TensorFlow-backed model using ML.NET 0.8.

The ML task

This is a rather small convolutional neural net that inputs an image and attempts to identify two areas of interest in the image. We are attempting to read some data off a hardware device.

The model

We have a trained Keras model (.5h) and we converted it into a frozen TensorFlow model (.pb). The net contains 4 convolutional layers, 1 dense layer (the output), and uses ReLU activations. The output of the model is an array of 8 numbers: [X1_1, Y1_1, X2_1, Y2_1, X2_1, Y2_1, X2_2, Y2_2]. These are the top-left and bottom-right corners of the two areas of interest.

Netron diagram:

current_model_best

This model works very well for our use case when we execute it using Keras.

the ML.NET scoring setup

Since our main application is written in C#, I tried to use ML.NET to access the predictions from C# code. The pipeline is setup like this:

var pipeline =
    new ImageLoadingEstimator(env, imagePath, (nameof(AreaInput.Path), "ImageData"))
    .Append(new ImageResizingEstimator(env, "ImageData", "ImageResized", Areas.ImageWidth,
        Areas.ImageHeight))
    .Append(new ImagePixelExtractingEstimator(env,
        new ImagePixelExtractorTransform.ColumnInfo(
            "ImageResized", Areas.InputLayerName,
            ImagePixelExtractorTransform.ColorBits.Rgb, interleave: true,
            scale: 1 / 255f)))
    .Append(new TensorFlowEstimator(env, modelPath, new[] {Areas.InputLayerName},
        new[] {Areas.OutputLayerName}));

On the surface, we do the same thing as with Keras:

  1. Load an image from the file system
  2. Resize the image to 400x400 (the model's expected input size)
  3. Convert the image to a vector (dimmensions of (batch_size, 400, 400, 3) - we are using RGB mode)
  4. Feed this vector into the TF model

Then we make a predictor function from this pipeline and use that to score examples.

The issue

There are actually two issues with this. One is a problem with interoperability - the ImageResizingEstimator is doing something very different than Keras' load_img:

  • load_img(x, target_size=(400,400), color_mode="rgb") loads the image in RGB and resizes it to 400x400. The image ends up "squeezed" into a square
  • ImageResizingEstimator to 400x400 either pads or crops the source image to be 400x400. The result of this operation is therefore very different.
  • Resizing the source images to 400x400 BEFORE feeding them into the ImageResizingEstimator actually leads to an ArgumentException - the resulting bitmap is invalid when it gets to the ImagePixelExtractingEstimator stage.

This is mainly an issue with documentation, I guess - it would be very helpful to declare this behavior explicitly, especially for interoperability scenarios such as this. At any rate, while realizing this, we went and resized our training images to 401x401 (to bypass the third bullet point issue) and retrained the model.

Our theory was that will both allow the resizer to do its job without exception and will avoid the difference in behavior of the resize behavior - since resizing from 401x401 to 400x400 does not change the bitmap too much. So the expectation was that feeding the same 401x401 image to Keras and ML.NET should produce the same predictions.

This did not happen. The ML.NET code seems to be working and produces predictions, however these predictions are way off of the Keras results (the Keras/TF results are actually correct which we proved by drawing the resulting coordinates on the images).

So the problem here is: is our TF setup wrong? Or is this some underlying difference in behavior which we can't see? I'm attaching a repro code as well. The ML.NET part is a .NET Core 2.2 project. To run the Keras part, you'll need the whole Keras stack installed (Keras + Tensorflow). The models and some sample images are included as well.

Sample output demonstrating the issue:

ML.NET predictions for three images:
78,151,124,196,157,152,208,195
83,160,131,208,167,162,220,207
82,161,136,212,174,163,229,211

Keras output for the same three images:
[96, 182, 166, 247, 218, 179, 291, 247]
[97, 186, 169, 253, 222, 182, 297, 252]
[99, 252, 172, 314, 225, 249, 294, 311]

Some additional details:
We tested this both against TensorFlow and Tensorflow-GPU, the issue persists. Therefore it's not a problem of running on GPU vs CPU. One possible explanation would be that I simply froze the TF graph wrong by specifying an incorrect final node, but I'm pretty sure TensorBoard is telling me that final_layer/BiasAdd is my final node:
image

Any help with tracing this down would be greatly appreciated - operationalizing a TF model directly from C# is an awesome capability to have.

Source

MLNetConvNetRepro.zip

@mareklinka
Copy link
Contributor Author

mareklinka commented Jan 4, 2019

I did some more investigation on this. First, I focused on the ImageResizingEstimator. What I noticed there is that the bitmap's BitmapFormat changes after the resize operation. My JPEGs are 24bpp when loaded and 32bpp after resizing. However, since I'm using RGB mode on the pixel extraction step, this most likely doesn't matter.

Next, I tried reconstructing the bitmap after the ImagePixelExtractingEstimator step using the following code:

var stage3 = stage2.Append(new ImagePixelExtractingEstimator(env,
    new ImagePixelExtractorTransform.ColumnInfo(
        "ImageResized", Areas.InputLayerName,
        ImagePixelExtractorTransform.ColorBits.Rgb, interleave: true,
        scale: 1 / 255f)));
var vector = (VBuffer<float>) stage3.Preview(data).RowView.First().Values.Last().Value;

var reconstruction = new Bitmap(400, 400);
var floats = vector.DenseValues().ToList();

var arrayPointer = 0;
for (var col = 0; col < 400; ++col)
{
    for (var row = 0; row < 400; ++row)
    {
        reconstruction.SetPixel(col, row,
            Color.FromArgb((int)(floats[arrayPointer++] * 255), (int)(floats[arrayPointer++] * 255),
                (int)(floats[arrayPointer++] * 255)));
    }
}

reconstruction.Save("reconstructed.jpg", ImageFormat.Jpeg);

This gave me back pretty much the same image as was the original input, no problems there. I even ran the Keras model on this reconstructed image and it gave me correct answers, while the ML.NET run produced the wrong output again. Note that the reconstruction fills the bitmap by columns. This is because, as I noticed while reading the source for ImagePixelExtractorTransform, the extractor fills the vector COLUMN FIRST. (I'm using Interleave=True for channels-last behavior).

This was interesting, so I went and checked what the Keras' img_to_array does. And it turns out they are not the same. img_to_array is filled BY ROW:

Keras code and result:

img_array = img_to_array(image)
img_array = img_array
print(img_array.reshape((480000,)))
# produces
# [195. 198. 155. 215. 218. 175. 201. 204. 159. 191. ...  48.  91. 100.  97.  98. 107. 104.  93. 102.  99.]

This corresponds directly to pixel color values (195 198 155) (215 218 175) (201 204 159) etc, which form a row in the image, not a column.

Compare that to what ML.NET does:

var vector = (VBuffer<float>) stage3.Preview(data).RowView.First().Values.Last().Value;
var floats = vector.DenseValues().ToList();
// produces
// 195 198 155 191 193 154 179 180 149 208

This corresponds to pixel values (195 198 155) (191 193 154) (179 180 149) etc, which form a column in the image. I confirmed the orientations visually using Paint.Net just to make sure.

This leads me to conclude that ML.NET is feeding the image into TensorFlow rotated by 90 degrees, leading to wrong results (obviously). To confirm the hypothesis, I went back to the image, flipped it horizontally and rotated it 90 degrees counter-clockwise - this means what was originally a row was now a column. Feeding this flip-rotated image into the ML.NET code, without any other changes, gave me back CORRECT results:

# python: [96, 182, 166, 248, 218, 179, 291, 247]
// ML.NET: 97,184,167,250,219,180,290,247

There are some small differences but I attribute those to GPU vs. CPU precision differences. Good enough.

So the bottom line here is the problem is either on my side (training on row-based data when it should have been column-based data) or in the ImagePixelExtractorTransform (extracting by column first). Thoughts? Is there a reason why ImagePixelExtractorTransform works as it does currently?

At any rate, I now have, if not a solution, a workaround at least. Additionally, the issue still remains that ImageResizingEstimator can't resize to the same size as the original image (400x400 -> 400x400 causes an exception when the bitmap reaches the pixel extraction step) and there is no way to just not use ImageResizingEstimator and go directly to ImagePixelExtractorTransform (because of unknown bitmap size and no way to specify the bitmap size manually, AFAIK).

@yaeldekel yaeldekel self-assigned this Jan 4, 2019
@yaeldekel
Copy link

Hi @mareklinka , thanks for opening this issue and for the detailed explanation of your investigation.
Regarding the issue with column major/row major representations, we have opened an issue about this a while ago (#765), I just moved it out of the backlog so we can triage it again.
I will look into the resizing exception and get back to you about it.

@mareklinka
Copy link
Contributor Author

mareklinka commented Jan 5, 2019

Thanks, @yaeldekel . Speaking of the resize transform, would it be possible to add a third resize mode to it that would not preserve aspect ratio and instead squeezed/expanded the image into the new dimensions? Something like ResizingKind.IgnoreAspectRatio or ResizingKind.Fill. I think this would greatly help in interoperability for various computer vision tasks where the resizing is not being done by padding/cropping.

It could be achieved by using something like this SO answer.

This even looks like something I could take a look at implementing, if deemed acceptable.

@yaeldekel
Copy link

Hi @mareklinka, would it be possible for you to share one of the 400x400 images that gave you an exception? I have tried reproducing this exception with a few images that I have on my machine, but haven't been successful.

@yaeldekel yaeldekel added the need info This issue needs more info before triage label Jan 7, 2019
@mareklinka
Copy link
Contributor Author

No problem, @yaeldekel. I reproduced this as a unit test project - just hit dotnet test and you should get one failing test. The message states Operation is not valid due to the current state of the object. This seems to be unrelated to the image in question - for the repro I just used a random old screenshot.

One test resizes from full-size image to 400x400. This passes.
One test resizes from 400x400 to 401x401. This passes.
One test resizes from 400x400 to 400x400. This fails.

Full stack:

Error Message:
 System.InvalidOperationException : Operation is not valid due to the current state of the object.
Stack Trace:
   at Microsoft.ML.Runtime.ImageAnalytics.ImagePixelExtractorTransform.Mapper.<>c__DisplayClass5_0`1.<GetGetterCore>b__1(VBuffer`1& dst)
   at Microsoft.ML.Data.DataDebuggerPreview.<>c__DisplayClass12_0`1.<MakeSetter>b__0(RowInfo rowInfo, List`1 list)
   at Microsoft.ML.Data.DataDebuggerPreview..ctor(IDataView data, Int32 maxRows)
   at Microsoft.ML.DebuggerExtensions.Preview(IEstimator`1 estimator, IDataView data, Int32 maxRows, Int32 maxTrainingRows)
   at ResizeErrorRepro.UnitTest1.TestCore(MLContext env, IDataView data, Int32 size) in C:\Source\Personal\WinMlOnnxTest\ResizeErrorRepro\UnitTest1.cs:line 86
   at ResizeErrorRepro.UnitTest1.ResizeToSameSizeTest() in C:\Source\Personal\WinMlOnnxTest\ResizeErrorRepro\UnitTest1.cs:line 45

ResizeErrorRepro.zip

@yaeldekel
Copy link

Hi @mareklinka , I still could not reproduce this error using the latest ML.NET, I think it might have something to do with this commit: 284e02c#diff-b0cd912107c73a762b5a5b0e99942387L486
which was checked in a few days after the release of v0.8. Version 0.9 was released today, could you kindly try running your tests with the new version to verify this?

@mareklinka
Copy link
Contributor Author

You are right, @yaeldekel, the exception does not occur using the 0.9 nuget packages. Thanks for the investigation!

@mareklinka
Copy link
Contributor Author

If I understand things correctly, #2130 should resolve the issue with ImagePixelExtractor loading images by column first, right (when Interleave=true)? This means this issue should potentially be resolved as well, I think.

@mareklinka
Copy link
Contributor Author

Ah, there is one more point to this issue - the resize mode. Having the Fill resize mode would solve the final outstanding point.

@Ivanidzo4ka
Copy link
Contributor

@mareklinka Any chance you willing to add Fill mode to transform? It shouldn't be super hard, main code which do resizing is here:


Basically all you need to do is set destX and destY to desired values for new enum type.

Most of us quite busy with API and documentation before 1.0 release, so chance to implement any feature request is quite small.

@mareklinka
Copy link
Contributor Author

@Ivanidzo4ka Of course, I can take a crack at it. As you said, it shouldn't be too hard.

@shauheen shauheen added bug Something isn't working and removed need info This issue needs more info before triage labels Feb 4, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants