-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Remove ColumnType.RawKind usages Round 2. #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removes the "easy" usages of ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
case DataKind.U8: | ||
return MakeScalarHashGetter<ulong, HashU8>(input, srcCol, seed, mask); | ||
case DataKind.U16: | ||
if (srcType.RawType == typeof(byte)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (srcType.RawType == typeof(byte)) [](start = 12, length = 36)
It used to not matter as much since it was a case statement, but now that it's a chained if statement we probably should give some thought to what is the most common case. The most common case for hashing is text, so we should probably check that first, rather than letting it fall through as the absolute last case when all else fails.
Also probably this would probably be a good time to to checks against the ColumnType
instances themselves -- the original author did things in fashion because they considered it to be more concise, but I'm not sure that reasoning holds any longer. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most common case for hashing is text, so we should probably check that first,
Good call. Fixing.
Also probably this would probably be a good time to to checks against the ColumnType instances themselves
In my mind, there are 2 reasons why I like comparing the types here, and not against the ColumnType instances.
- It is nice and easy to match up the
if (.... == typeof(byte))
with the next lineMakeScalarHashGetter<byte,....
. You can match thebyte
andbyte
and easily know you are using the same type. - It is reflective of the code above for
KeyType
, where you can't actually check forNumberType.U4
, since it is aKeyType
.
I don't feel strongly about these reasons, but they do seem to be nice. If you'd like it the other way, I can change it. Just let me know if you feel strongly for it to be changed.
In reply to: 248925374 [](ancestors = 248925374)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I think your reasoning works.
In reply to: 249119279 [](ancestors = 249119279,248925374)
@@ -767,7 +767,7 @@ public override SchemaShape GetOutputSchema(SchemaShape inputSchema) | |||
{ | |||
if (!inputSchema.TryFindColumn(colInfo.Input, out var col)) | |||
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input); | |||
if ((col.ItemType.GetItemType().RawKind == default) || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) | |||
if (!col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) [](start = 21, length = 123)
Not your code, but this test seems completely wrong. As in, almost nothing about this test looks right, or even sensible. col.ItemType
could never be a VectorType
in any instance, according to its definition. In fact under the current type scheme, discounting any types a user may create themselves and the image type, I think this will succeed on everything??
Actually wait, the error message below is wrong too... it will just claim the column is missing (you see it is identical to the test below)... Many of the metadata tests seem off... And of course it claims that it always produces a vector of known size, but that's definitely not true. Wow. I was going to suggest you fix this when I thought it was just confined to one test, but it seems like literally half the lines in this are flat out wrong. OK. :) I'll file a separate issue...
Filed as #2185. Don't bother fixing of course. :D :D #WontFix
@@ -352,7 +352,7 @@ internal static void SaveType(ModelSaveContext ctx, ColumnType type) | |||
ctx.Writer.Write(vectorType?.Size ?? 0); | |||
|
|||
ColumnType itemType = vectorType?.ItemType ?? type; | |||
var itemKind = itemType.RawKind; | |||
itemType.RawType.TryGetDataKind(out DataKind itemKind); | |||
Contracts.Assert(itemKind == DataKind.R4 || itemKind == DataKind.R8); | |||
ctx.Writer.Write((byte)itemKind); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a very, very strange format... if there are only two choices, why not just write a bool? Nonetheless, the die is cast, we must live with the format, such as it is...
Given that, I think if we want to get rid of datakind, I think the best choice would be to write the bytes as constants... so I guess, ctx.Writer.Write((byte)(itemType == NumberType.Double ? 10 : 9))
or something like that. I guess we'd need a similar test during the deserialization. Not sure if you're up for this in this PR... maybe the scope here is more limited. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the scope of my current work is to remove ColumnType.RawKind
, and not necessarily to remove DataKind
altogether. I have a hard time believing we will be able to get rid of it completely. We may be able to internalize it, but there are a few other places that have it exposed publicly - TextLoad, TypeConverting, PartitionedPathParser.
In reply to: 248935774 [](ancestors = 248935774)
@@ -60,7 +60,7 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema) | |||
if (!inputSchema.TryFindColumn(colInfo.Input, out var col)) | |||
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input); | |||
|
|||
if ((col.ItemType.GetItemType().RawKind == default) || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) | |||
if (!col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) [](start = 16, length = 128)
Oh boy. More similar problems... I'll make a note of it #2185. #WontFix
{ | ||
VectorType vectorType = type as VectorType; | ||
ColumnType itemType = vectorType?.ItemType ?? type; | ||
|
||
if (itemType.RawKind != default && (vectorType != null || type is PrimitiveType)) | ||
if (itemType.RawType.IsValidDataKindType() && (vectorType != null || type is PrimitiveType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& (vectorType != null || type is PrimitiveType) [](start = 54, length = 49)
I could be wrong, but it seems like many of these IsValidDataKindType
are actually trying to capture the logic of what you've already written in your IsStandardScalar
extension method on ColumnType
. That is, "does it have one of the known datakinds," which is functionally the same as "is it one of the known builtin types." I guess you were trying to capture the original method, but I think the two are redundant. And, I'd favor your extension method you introduced two weeks ago to this new method, since this new method seems kind of gnarly.
(Also, incidentally I do not believe, if this first test succeeds, that it is not possible for the statement after the &&
to be false.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many of these IsValidDataKindType are actually trying to capture the logic of what you've already written in your IsStandardScalar extension method on ColumnType
That's a good call. I forgot about that method. I've removed IsValidDataKindType
and am using IsStandardScalar
instead.
I do not believe, if this first test succeeds, that it is not possible for the statement after the && to be false.
Agreed. Removed the 2nd expression.
In reply to: 248936996 [](ancestors = 248936996)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this broke a test. There is a test that passes in a Vec<Key<U4>>
into this method. Thus, RawType.IsValidDataKindType
and IsStandardScalar
are not equivalent because the first allows for KeyTypes.
In reply to: 249152590 [](ancestors = 249152590,248936996)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh interesting. Anyway I like your solution to the problem I see below.
In reply to: 249206559 [](ancestors = 249206559,249152590,248936996)
return MakeScalarHashGetter<ushort, HashKey2>(input, srcCol, seed, mask); | ||
else if (srcType.RawType == typeof(uint)) | ||
return MakeScalarHashGetter<uint, HashKey4>(input, srcCol, seed, mask); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By a similar token, I might want to check for uint
first, since that is going to be the key type 99% of the time. :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
return base.LoadName; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make me feel a little bit better about this? Are we sure this won't break binary format backcompat? It just seems strange that we can just pffft get rid of loadnames like this.
Now granted, this code is absolutely archaic so it's entirely possible, even likely, that some parts have somehow become unnecessary or redundant, but, I'd feel better with an explanation. :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll attempt to make you feel better 😄.
UnsafeTypeCodec<T>
derives from SimpleCodec<T>
. If you look at SimpleCodec<T>.LoadName
:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Binary/Codecs.cs
Lines 117 to 118 in b1cc8eb
// For these basic types, the class name will do perfectly well. | |
public virtual string LoadName => typeof(T).Name; |
Now, Type.RawKind
(which this deleted code was using) comes from:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Binary/Codecs.cs
Lines 178 to 186 in b1cc8eb
// Gatekeeper to ensure T is a type that is supported by UnsafeTypeCodec. | |
// Throws an exception if T is neither a TimeSpan nor a NumberType. | |
private static ColumnType UnsafeColumnType(Type type) | |
{ | |
return type == typeof(TimeSpan) ? (ColumnType)TimeSpanType.Instance : NumberType.FromType(type); | |
} | |
public UnsafeTypeCodec(CodecFactory factory) | |
: base(factory, UnsafeColumnType(typeof(T))) |
And there is even an assert that ensures Type.RawType == typeof(T)
:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Binary/Codecs.cs
Lines 120 to 127 in b1cc8eb
public SimpleCodec(CodecFactory factory, ColumnType type) | |
{ | |
Contracts.AssertValue(factory); | |
Contracts.AssertValue(type); | |
Contracts.Assert(type.RawType == typeof(T)); | |
Factory = factory; | |
Type = type; | |
} |
So now, looking at the deleted code, it is just returning typeof(T).Name
, which is redundant with what base.LoadName
is doing. I started changing this code to check for these 5 types and return Type.RawType.Name
, but then discovered this is what the base class is already doing. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| type == typeof(ulong) | ||
|| type == typeof(float) | ||
|| type == typeof(double) | ||
|| type == typeof(ReadOnlyMemory<char>) || type == typeof(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| type == typeof(string) [](start = 55, length = 26)
So, can we check this? The raw type should never be a string. I see that above we also somehow map string
to a data kind in TryGetDataKind
, but that would be a bug. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestFriend] | ||
internal static PrimitiveType FromType(Type type) | ||
{ | ||
if (type == typeof(ReadOnlyMemory<char>) || type == typeof(string)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| type == typeof(string)) [](start = 52, length = 27)
This also seems wrong. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed || type == typeof(string)
from this line. All the callers are guaranteed to do the string
=> ROM<char>
type conversion before making this call.
In reply to: 248938623 [](ancestors = 248938623)
@@ -263,7 +263,7 @@ internal static IEnumerable<int> GetColumnSet(this Schema schema, string metadat | |||
for (int col = 0; col < schema.Count; col++) | |||
{ | |||
var columnType = schema[col].Metadata.Schema.GetColumnOrNull(metadataKind)?.Type; | |||
if (columnType != null && columnType is KeyType && columnType.RawKind == DataKind.U4) | |||
if (columnType != null && columnType is KeyType && columnType.RawType == typeof(uint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnType != null && [](start = 20, length = 22)
Not a big deal, but I'd expect that if columnType
were null, it would fail the columnType is KeyType
anyway, so why bother testing that it is null? Similar with the condition about 24 lines up. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -183,7 +183,7 @@ private static bool CheckScoreColumnKind(Schema schema, int col) | |||
|
|||
// Get the score column set id from colScore. | |||
var type = schema[colScore].Metadata.Schema.GetColumnOrNull(MetadataUtils.Kinds.ScoreColumnSetId)?.Type; | |||
if (type == null || !(type is KeyType) || type.RawKind != DataKind.U4) | |||
if (type == null || !(type is KeyType) || type.RawType != typeof(uint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type == null || [](start = 16, length = 15)
Similar note, I think we can do without the null check given that the is
test will catch the null case anyway, I think? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Get the data kind, and the item's column type. | ||
if (rawItemType == typeof(string)) | ||
kind = DataKind.Text; | ||
else if (!rawItemType.TryGetDataKind(out kind)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually important -- we do not want string
associated with the type TextType
, because TextType.Instance.RawType
is not string
, it is ReadOnlyMemory<char>
. We need to handle that type mapping "specially" just as we specially map arrays to VBuffer
in other code here, and so on, and so on. It is not the job of ColumnType
to be aware of the details of how in this convenience API we perform mappings from user types into IDataView
land. So if that is why I see those tests against string
that I commented on earlier as being odd, let's not do that. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I've added this back, by doing the string
=> ROM<char>
type conversion here.
In reply to: 248940771 [](ancestors = 248940771)
Looking for reviews and sign offs on this PR, if anyone is available. #Resolved |
// REVIEW: This should really be float. But, for the | ||
return Int; | ||
} | ||
else if (rawType == typeof(uint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint [](start = 43, length = 4)
not your change, but i wonder if uint belongs here, and not with the clause above, where typeof(int) is included.
AFAIK int and uint have the same length. Are we going by numeric ranges? @weschin #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume if you are using uint
, you have the possibility of having values that are greater than the signed int.MaxValue
. Thus, if we did use Int
, it would overflow to a negative value, which I'm sure wouldn't produce the correct results.
In reply to: 249968334 [](ancestors = 249968334)
} | ||
|
||
return null; | ||
} | ||
|
||
public static JToken DefaultTokenOrNull(PrimitiveType itemType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultTokenOrNull [](start = 33, length = 18)
can most of this be substituted with default(itemType) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We could do that if this method was a generic method - DefaultTokenOrNull<T>
. Then you could just return default(T)
. But since the System.Type
we are inspecting here isn't known at compile time, you can't use the C# default
keyword.
In reply to: 249968942 [](ancestors = 249968942)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removes the "easy" usages of ColumnType.RawKind.
Part of the work necessary for #1860 and contributes to #1533.
There are still about 38 usages of ColumnType.RawKind left, but they are a little more involved, and needs refactoring. I will fully remove them in the next PR.