-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Remove ColumnType.RawKind usages Round 1. #2143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rawType = type.RawType; | ||
|
||
if (rawType == typeof(bool)) | ||
dataType = TensorProto.Types.DataType.Float; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float [](start = 54, length = 5)
Unrelated to your changes, I wander if this is correct. @wschin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing when I made the change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably is caused by the limited types accepted by old ONNX operators implemented in RS4. As ONNXRuntime supports much more ops than dark-age implementations, we should revise our type mappings and create more tests. This is a reasonable setting but it can be improved in another PR.
In reply to: 247951442 [](ancestors = 247951442)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have an existing bug to make this better - #1198
|
||
default: | ||
throw Contracts.ExceptParam(nameof(kind), $"Unrecognized type '{kind}'"); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return null; [](start = 20, length = 12)
curious, why change to return null, rather than throw? If an invalid type gets passed, would it be better to throw? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to preserve the first line's behavior:
case default(DataKind): return null;
When DataKind is default
, it usually means it is a "custom" or "structured" type - like VectorType
or ImageType
. So I was preserving that behavior here - when you see an unrecognized ColumnType
- return null
.
I assume the existing throw
condition was in place for handling invalid enum
values - like 29
and -451
.
In reply to: 247782795 [](ancestors = 247782795)
src/Microsoft.ML.Onnx/OnnxUtils.cs
Outdated
string msg = "Unsupported type: DataKind " + rawKind.ToString(); | ||
Contracts.Check(false, msg); | ||
break; | ||
string msg = "Unsupported type: " + rawType.ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rawType.ToString() [](start = 52, length = 18)
It would be better to use the actual type ToString
here, rather than the .NET type. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rawKind = keyType.RawKind; | ||
rawType = vectorType.ItemType.RawType; | ||
else | ||
rawType = type.RawType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RawType [](start = 31, length = 7)
I am not sure operating over RawType
is an appropriate alternative to RawKind
, in this case. I might prefer that we operate over the exact ColumnType
implementations, since that is more descriptive of what we're trying to do I think.
To give a concrete example, I see the code below is conflating uint
as being identical, even thought the same .NET type is used for both raw uint
s and actual key types... same .NET type, very different information being encoded. This might be a sign that the entire function is screwed up (I see @sfilipi found what appears to be a bug below with booleans, so that's well within the realm of possibility), but I'm a little concerned that this is the first thing I see. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure operating over RawType is an appropriate alternative to RawKind, in this case
I don't think I agree. In this case we are mapping from ML.NET's ColumnType
to Onnx's TensorProto.Types.DataType
. It doesn't really matter if we are using a KeyType
or NumberType.U4
. In either case, we need to get a TensorProto.Types.DataType.UInt32
. Actually - what would we return for KeyType
? We wouldn't want to hard-code it to just TensorProto.Types.DataType.UInt32
, would we? What if it was a KeyType that was using UInt16
?
In reply to: 248033212 [](ancestors = 248033212)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case BoolType boolType: | ||
return columnType.RawType; | ||
case TextType textType: | ||
return typeof(string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this style of change you have here is more what I'd be more comfortable with in the the OnnxUtils
case. (Now, in that other case, it would also probably benefit from some when
statements to further refine.) #Resolved
@@ -478,7 +478,7 @@ public override SchemaShape GetOutputSchema(SchemaShape inputSchema) | |||
{ | |||
if (!inputSchema.TryFindColumn(colInfo.Input, out var col)) | |||
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input); | |||
if ((col.ItemType.ItemType.RawKind == default) || !(col.ItemType.IsVector || col.ItemType is PrimitiveType)) | |||
if (!(col.ItemType.IsVector || col.ItemType is PrimitiveType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ItemType [](start = 51, length = 8)
I think this is probably going to change anyway with your other change on vectors, but just one thing to keep in mind: While ItemType
on ColumnType
itself was a ColumnType
(which is correct), ItemType
on the vector type itself is and always has been a PrimitiveType
anyway. So this test here seems at least odd, even in its refined form. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is a bug (or at least confusion) in our API in my opinion. col
is of type SchemaShape.Column
. It has a member on it - public readonly ColumnType ItemType;
. So col.ItemType
isn't actually the VectorType.ItemType
. It could be any ColumnType
(for example, it could be an ImageType
).
We then split out the "vector"-ness to a different public readonly VectorKind Kind;
property.
In reply to: 248034844 [](ancestors = 248034844)
rawKind = vectorType.ItemType.RawKind; | ||
else if (type is KeyType keyType) | ||
rawKind = keyType.RawKind; | ||
rawType = vectorType.ItemType.RawType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rawType = vectorType.ItemType.RawType; [](start = 20, length = 38)
Placeholder note, once your other PR goes in you could probably benefit from your new GetItemType
extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that will simplify this code. #Resolved
@@ -843,7 +843,7 @@ internal static bool IsColumnTypeValid(ColumnType type) | |||
if (!(type.ItemType is KeyType itemKeyType)) | |||
return false; | |||
// Can only accept key types that can be converted to U4. | |||
if (itemKeyType.Count == 0 && type.ItemType.RawKind > DataKind.U4) | |||
if (itemKeyType.Count == 0 && !NgramUtils.IsValidNgramRawType(itemKeyType.RawType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsValidNgramRawType(itemKeyType.RawType [](start = 54, length = 39)
If I read the comment, it says, Can only accept key types that can be converted to U4,
in every place that this IsValidNgramRawType
is being called. That basically means U1, U2, and U4 are acceptable, and only U8 is not. So I'd have expected, based on this comment, that this would be a test against whether the raw type is a ulong
or not.
However, if I then go and read this IsValidNgramRawType
, it is doing something else! It is checking against all sorts of types that have nothing to do with key-types at all, including sbyte
, short
, int
... which makes no sense to me, given the comment, the fact that these are definitely meant to operate over keytypes, etc.
I would have expected just making sure it isn't a ulong
would have been sufficient, but... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just converting the existing logic: type.ItemType.RawKind > DataKind.U4
.
I can change that method to assume that KeyType
s can only be byte
, ushort
, uint
, or ulong
, and only make it check for ulong
.
In reply to: 248037058 [](ancestors = 248037058)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks @eerhardt. I'm not sure why the original code was written this was -- it's certainly a very long winded way of writing "is U8." :) There is a competition, or used to be a competition, to see who could write the most obfuscated code possible, as I recall. (Not part of this codebase, to be clear, just in a university somewhere if I recall correctly, or a newsgroup or something?) Maybe the author was a participant.
In reply to: 248110407 [](ancestors = 248110407,248037058)
@@ -47,7 +47,7 @@ public void SequencePredictorSchemaTest() | |||
Assert.True(scoreColumn.Type is KeyType); | |||
Assert.Equal((scoreColumnType as KeyType).Min, (scoreColumn.Type as KeyType).Min); | |||
Assert.Equal((scoreColumnType as KeyType).Count, (scoreColumn.Type as KeyType).Count); | |||
Assert.Equal((scoreColumnType as KeyType).RawKind, (scoreColumn.Type as KeyType).RawKind); | |||
Assert.Equal((scoreColumnType as KeyType).RawType, (scoreColumn.Type as KeyType).RawType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoreColumn.Type as KeyType [](start = 64, length = 27)
This was not your code, and given that it's in a test code we might consider it lower priority, but this was so offensive that I have to comment on it. And you're touching it. :D :D
This code has a lot of nonsense like this:
Assert.Equal((scoreColumnType as KeyType).Min, (scoreColumn.Type as KeyType).Min);
That makes no sense. The only conceivable reason to use an as
style cast is if you're anticipating that it might not be this case. But then the code turns around and directly accesses a property without a ?.
, so that if it was in fact not that type we'd throw with a null reference exception. Which is really silly. I don't know if you want to fix all this code so that it is less offensive, but maybe at least the things you're touching directly could be written to be a bit more sane.
Not blocking, if you don't feel like doing it, just, the code in its current form makes my skin crawl. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if (rawType == typeof(int)) | ||
dataType = TensorProto.Types.DataType.Int32; | ||
else if (rawType == typeof(uint)) | ||
dataType = TensorProto.Types.DataType.Int64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wschin - this one looks incorrect as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears both BL and U4 mapped these values when this repo was created -
machinelearning/src/Microsoft.ML.Data/Model/Onnx/OnnxUtils.cs
Lines 296 to 328 in f0e639a
public static ModelArgs GetModelArgs(ColumnType type, string colName, | |
List<long> dims = null, List<bool> dimsParams = null) | |
{ | |
Contracts.CheckValue(type, nameof(type)); | |
Contracts.CheckNonEmpty(colName, nameof(colName)); | |
TensorProto.Types.DataType dataType = TensorProto.Types.DataType.Undefined; | |
DataKind rawKind; | |
if (type.IsVector) | |
rawKind = type.AsVector.ItemType.RawKind; | |
else if (type.IsKey) | |
rawKind = type.AsKey.RawKind; | |
else | |
rawKind = type.RawKind; | |
switch (rawKind) | |
{ | |
case DataKind.BL: | |
dataType = TensorProto.Types.DataType.Float; | |
break; | |
case DataKind.TX: | |
dataType = TensorProto.Types.DataType.String; | |
break; | |
case DataKind.U4: | |
dataType = TensorProto.Types.DataType.Int64; | |
break; | |
case DataKind.R4: | |
dataType = TensorProto.Types.DataType.Float; | |
break; | |
default: | |
Contracts.Assert(false, "Unknown type."); | |
break; | |
} |
In reply to: 248108591 [](ancestors = 248108591)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we always translate uint to ONNX Int64, this should be fine.
In reply to: 248112875 [](ancestors = 248112875)
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
91d59a6
to
b9dabfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eerhardt !! Thanks for getting all this done. Hopefully we (by "we" which I mean other people that actually did this ONNX stuff) can figure out the ONNX thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.
Part of the work necessary for #1860 and contributes to #1533.