Skip to content

Remove ColumnType.RawKind usages Round 2. #2176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 23, 2019

Conversation

eerhardt
Copy link
Member

Removes the "easy" usages of ColumnType.RawKind.

Part of the work necessary for #1860 and contributes to #1533.

There are still about 38 usages of ColumnType.RawKind left, but they are a little more involved, and needs refactoring. I will fully remove them in the next PR.

Removes the "easy" usages of ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
case DataKind.U8:
return MakeScalarHashGetter<ulong, HashU8>(input, srcCol, seed, mask);
case DataKind.U16:
if (srcType.RawType == typeof(byte))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (srcType.RawType == typeof(byte)) [](start = 12, length = 36)

It used to not matter as much since it was a case statement, but now that it's a chained if statement we probably should give some thought to what is the most common case. The most common case for hashing is text, so we should probably check that first, rather than letting it fall through as the absolute last case when all else fails.

Also probably this would probably be a good time to to checks against the ColumnType instances themselves -- the original author did things in fashion because they considered it to be more concise, but I'm not sure that reasoning holds any longer. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common case for hashing is text, so we should probably check that first,

Good call. Fixing.

Also probably this would probably be a good time to to checks against the ColumnType instances themselves

In my mind, there are 2 reasons why I like comparing the types here, and not against the ColumnType instances.

  1. It is nice and easy to match up the if (.... == typeof(byte)) with the next line MakeScalarHashGetter<byte,..... You can match the byte and byte and easily know you are using the same type.
  2. It is reflective of the code above for KeyType, where you can't actually check for NumberType.U4, since it is a KeyType.

I don't feel strongly about these reasons, but they do seem to be nice. If you'd like it the other way, I can change it. Just let me know if you feel strongly for it to be changed.


In reply to: 248925374 [](ancestors = 248925374)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I think your reasoning works.


In reply to: 249119279 [](ancestors = 249119279,248925374)

@@ -767,7 +767,7 @@ public override SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
if (!inputSchema.TryFindColumn(colInfo.Input, out var col))
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input);
if ((col.ItemType.GetItemType().RawKind == default) || !(col.ItemType is VectorType || col.ItemType is PrimitiveType))
if (!col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) [](start = 21, length = 123)

Not your code, but this test seems completely wrong. As in, almost nothing about this test looks right, or even sensible. col.ItemType could never be a VectorType in any instance, according to its definition. In fact under the current type scheme, discounting any types a user may create themselves and the image type, I think this will succeed on everything??

Actually wait, the error message below is wrong too... it will just claim the column is missing (you see it is identical to the test below)... Many of the metadata tests seem off... And of course it claims that it always produces a vector of known size, but that's definitely not true. Wow. I was going to suggest you fix this when I thought it was just confined to one test, but it seems like literally half the lines in this are flat out wrong. OK. :) I'll file a separate issue...

Filed as #2185. Don't bother fixing of course. :D :D #WontFix

@@ -352,7 +352,7 @@ internal static void SaveType(ModelSaveContext ctx, ColumnType type)
ctx.Writer.Write(vectorType?.Size ?? 0);

ColumnType itemType = vectorType?.ItemType ?? type;
var itemKind = itemType.RawKind;
itemType.RawType.TryGetDataKind(out DataKind itemKind);
Contracts.Assert(itemKind == DataKind.R4 || itemKind == DataKind.R8);
ctx.Writer.Write((byte)itemKind);
}
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a very, very strange format... if there are only two choices, why not just write a bool? Nonetheless, the die is cast, we must live with the format, such as it is...

Given that, I think if we want to get rid of datakind, I think the best choice would be to write the bytes as constants... so I guess, ctx.Writer.Write((byte)(itemType == NumberType.Double ? 10 : 9)) or something like that. I guess we'd need a similar test during the deserialization. Not sure if you're up for this in this PR... maybe the scope here is more limited. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the scope of my current work is to remove ColumnType.RawKind, and not necessarily to remove DataKind altogether. I have a hard time believing we will be able to get rid of it completely. We may be able to internalize it, but there are a few other places that have it exposed publicly - TextLoad, TypeConverting, PartitionedPathParser.


In reply to: 248935774 [](ancestors = 248935774)

@@ -60,7 +60,7 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema)
if (!inputSchema.TryFindColumn(colInfo.Input, out var col))
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input);

if ((col.ItemType.GetItemType().RawKind == default) || !(col.ItemType is VectorType || col.ItemType is PrimitiveType))
if (!col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!col.ItemType.GetItemType().RawType.IsValidDataKindType() || !(col.ItemType is VectorType || col.ItemType is PrimitiveType)) [](start = 16, length = 128)

Oh boy. More similar problems... I'll make a note of it #2185. #WontFix

{
VectorType vectorType = type as VectorType;
ColumnType itemType = vectorType?.ItemType ?? type;

if (itemType.RawKind != default && (vectorType != null || type is PrimitiveType))
if (itemType.RawType.IsValidDataKindType() && (vectorType != null || type is PrimitiveType))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& (vectorType != null || type is PrimitiveType) [](start = 54, length = 49)

I could be wrong, but it seems like many of these IsValidDataKindType are actually trying to capture the logic of what you've already written in your IsStandardScalar extension method on ColumnType. That is, "does it have one of the known datakinds," which is functionally the same as "is it one of the known builtin types." I guess you were trying to capture the original method, but I think the two are redundant. And, I'd favor your extension method you introduced two weeks ago to this new method, since this new method seems kind of gnarly.

(Also, incidentally I do not believe, if this first test succeeds, that it is not possible for the statement after the && to be false.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many of these IsValidDataKindType are actually trying to capture the logic of what you've already written in your IsStandardScalar extension method on ColumnType

That's a good call. I forgot about that method. I've removed IsValidDataKindType and am using IsStandardScalar instead.

I do not believe, if this first test succeeds, that it is not possible for the statement after the && to be false.

Agreed. Removed the 2nd expression.


In reply to: 248936996 [](ancestors = 248936996)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this broke a test. There is a test that passes in a Vec<Key<U4>> into this method. Thus, RawType.IsValidDataKindType and IsStandardScalar are not equivalent because the first allows for KeyTypes.


In reply to: 249152590 [](ancestors = 249152590,248936996)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh interesting. Anyway I like your solution to the problem I see below.


In reply to: 249206559 [](ancestors = 249206559,249152590,248936996)

return MakeScalarHashGetter<ushort, HashKey2>(input, srcCol, seed, mask);
else if (srcType.RawType == typeof(uint))
return MakeScalarHashGetter<uint, HashKey4>(input, srcCol, seed, mask);

Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By a similar token, I might want to check for uint first, since that is going to be the key type 99% of the time. :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 248937303 [](ancestors = 248937303)

}
return base.LoadName;
}
}
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make me feel a little bit better about this? Are we sure this won't break binary format backcompat? It just seems strange that we can just pffft get rid of loadnames like this.

Now granted, this code is absolutely archaic so it's entirely possible, even likely, that some parts have somehow become unnecessary or redundant, but, I'd feel better with an explanation. :) #Resolved

Copy link
Member Author

@eerhardt eerhardt Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll attempt to make you feel better 😄.

UnsafeTypeCodec<T> derives from SimpleCodec<T>. If you look at SimpleCodec<T>.LoadName:

// For these basic types, the class name will do perfectly well.
public virtual string LoadName => typeof(T).Name;

Now, Type.RawKind (which this deleted code was using) comes from:

// Gatekeeper to ensure T is a type that is supported by UnsafeTypeCodec.
// Throws an exception if T is neither a TimeSpan nor a NumberType.
private static ColumnType UnsafeColumnType(Type type)
{
return type == typeof(TimeSpan) ? (ColumnType)TimeSpanType.Instance : NumberType.FromType(type);
}
public UnsafeTypeCodec(CodecFactory factory)
: base(factory, UnsafeColumnType(typeof(T)))

And there is even an assert that ensures Type.RawType == typeof(T):

public SimpleCodec(CodecFactory factory, ColumnType type)
{
Contracts.AssertValue(factory);
Contracts.AssertValue(type);
Contracts.Assert(type.RawType == typeof(T));
Factory = factory;
Type = type;
}

So now, looking at the deleted code, it is just returning typeof(T).Name, which is redundant with what base.LoadName is doing. I started changing this code to check for these 5 types and return Type.RawType.Name, but then discovered this is what the base class is already doing. #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yaeldekel in case it breaks something in TLC


In reply to: 249104896 [](ancestors = 249104896)

|| type == typeof(ulong)
|| type == typeof(float)
|| type == typeof(double)
|| type == typeof(ReadOnlyMemory<char>) || type == typeof(string)
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| type == typeof(string) [](start = 55, length = 26)

So, can we check this? The raw type should never be a string. I see that above we also somehow map string to a data kind in TryGetDataKind, but that would be a bug. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deleted this method.


In reply to: 248938496 [](ancestors = 248938496)

[BestFriend]
internal static PrimitiveType FromType(Type type)
{
if (type == typeof(ReadOnlyMemory<char>) || type == typeof(string))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| type == typeof(string)) [](start = 52, length = 27)

This also seems wrong. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed || type == typeof(string) from this line. All the callers are guaranteed to do the string => ROM<char> type conversion before making this call.


In reply to: 248938623 [](ancestors = 248938623)

@@ -263,7 +263,7 @@ internal static IEnumerable<int> GetColumnSet(this Schema schema, string metadat
for (int col = 0; col < schema.Count; col++)
{
var columnType = schema[col].Metadata.Schema.GetColumnOrNull(metadataKind)?.Type;
if (columnType != null && columnType is KeyType && columnType.RawKind == DataKind.U4)
if (columnType != null && columnType is KeyType && columnType.RawType == typeof(uint))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnType != null && [](start = 20, length = 22)

Not a big deal, but I'd expect that if columnType were null, it would fail the columnType is KeyType anyway, so why bother testing that it is null? Similar with the condition about 24 lines up. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 248938772 [](ancestors = 248938772)

@@ -183,7 +183,7 @@ private static bool CheckScoreColumnKind(Schema schema, int col)

// Get the score column set id from colScore.
var type = schema[colScore].Metadata.Schema.GetColumnOrNull(MetadataUtils.Kinds.ScoreColumnSetId)?.Type;
if (type == null || !(type is KeyType) || type.RawKind != DataKind.U4)
if (type == null || !(type is KeyType) || type.RawType != typeof(uint))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type == null || [](start = 16, length = 15)

Similar note, I think we can do without the null check given that the is test will catch the null case anyway, I think? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In reply to: 248939217 [](ancestors = 248939217)

// Get the data kind, and the item's column type.
if (rawItemType == typeof(string))
kind = DataKind.Text;
else if (!rawItemType.TryGetDataKind(out kind))
Copy link
Contributor

@TomFinley TomFinley Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually important -- we do not want string associated with the type TextType, because TextType.Instance.RawType is not string, it is ReadOnlyMemory<char>. We need to handle that type mapping "specially" just as we specially map arrays to VBuffer in other code here, and so on, and so on. It is not the job of ColumnType to be aware of the details of how in this convenience API we perform mappings from user types into IDataView land. So if that is why I see those tests against string that I commented on earlier as being odd, let's not do that. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I've added this back, by doing the string => ROM<char> type conversion here.


In reply to: 248940771 [](ancestors = 248940771)

@eerhardt
Copy link
Member Author

eerhardt commented Jan 22, 2019

Looking for reviews and sign offs on this PR, if anyone is available. #Resolved

// REVIEW: This should really be float. But, for the
return Int;
}
else if (rawType == typeof(uint)
Copy link
Member

@sfilipi sfilipi Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint [](start = 43, length = 4)

not your change, but i wonder if uint belongs here, and not with the clause above, where typeof(int) is included.
AFAIK int and uint have the same length. Are we going by numeric ranges? @weschin #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if you are using uint, you have the possibility of having values that are greater than the signed int.MaxValue. Thus, if we did use Int, it would overflow to a negative value, which I'm sure wouldn't produce the correct results.


In reply to: 249968334 [](ancestors = 249968334)

}

return null;
}

public static JToken DefaultTokenOrNull(PrimitiveType itemType)
Copy link
Member

@sfilipi sfilipi Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultTokenOrNull [](start = 33, length = 18)

can most of this be substituted with default(itemType) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We could do that if this method was a generic method - DefaultTokenOrNull<T>. Then you could just return default(T). But since the System.Type we are inspecting here isn't known at compile time, you can't use the C# default keyword.


In reply to: 249968942 [](ancestors = 249968942)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@eerhardt eerhardt merged commit 925059d into dotnet:master Jan 23, 2019
@eerhardt eerhardt deleted the ColumnTypeRawKind2 branch January 23, 2019 15:15
@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants