-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Replace DvInt* with .NET standard data types. WIP #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RegisterSimpleCodec(new UnsafeTypeCodec<byte>(this)); | ||
RegisterSimpleCodec(new UnsafeTypeCodec<DvInt2>(this)); | ||
RegisterSimpleCodec(new UnsafeTypeCodec<Int16>(this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int16 [](start = 52, length = 5)
short
, int
, long
, not Int16
, Int32
, Int64
. #Closed
@@ -417,9 +417,9 @@ protected override void GetMetadataCore<TValue>(string kind, int iinfo, ref TVal | |||
} | |||
} | |||
|
|||
private void GetCategoricalSlotRanges(int iiinfo, ref VBuffer<DvInt4> dst) | |||
private void GetCategoricalSlotRanges(int iiinfo, ref VBuffer<Int32> dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int32 [](start = 74, length = 5)
Please just use the keywords for types (so, in this case int
), here and everywhere. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change across the code base using find-replace. #Resolved
|
||
public void Conv(ref long? src, ref DvInt8 dst) => dst = src ?? DvInt8.NA; | ||
public void Conv(ref long? src, ref Int64? dst) => dst = src; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are no conversions from float
to float
, or double
to double
. So: are these conversions actually necessary? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this code to not assume there can be nullables but I do see a conversion between byte and byte. @mandyshieh can you please take a look? #Resolved
You will need to be careful here. We want the "default" of the new nullable integer types to be 0, since that's both a more sensible choice and it retains sparsity preservation for conversion from The implications for things like The sparsity question will be quite far reaching, and will be by far the most challenging part of this change. #Resolved Refers to: src/Microsoft.ML.Data/Data/Conversion.cs:703 in a0ceabb. [](commit_id = a0ceabb, deletion_comment = False) |
RegisterSimpleCodec(new UnsafeTypeCodec<byte>(this)); | ||
RegisterSimpleCodec(new UnsafeTypeCodec<DvInt2>(this)); | ||
RegisterSimpleCodec(new UnsafeTypeCodec<Int16>(this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegisterSimpleCodec(new UnsafeTypeCodec(this)); [](start = 12, length = 54)
We will want the nullable types to be serializable. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed since there will no nullable types 👍 #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, thanks @codemzs. Incidentally were you going to update the description of #673 pursuant to our discussions yesterday? It is still the old description.
Edit: Actually I don't think you've linked to that issue from this PR at all. You should, maybe "fixes #673" in your description.
In reply to: 210744044 [](ancestors = 210744044)
@@ -525,7 +525,7 @@ private void UpdatePeg(Peg peg) | |||
|
|||
namespace Microsoft.ML.Runtime.Internal.Utilities | |||
{ | |||
// Reasonable choices are Double and System.Int64. | |||
// Reasonable choices are Double and System.long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long [](start = 48, length = 4)
fix this #Resolved
{ | ||
Ch.Assert(colType.ItemType == NumberType.I4); | ||
return CreateConvertingArrayGetterDelegate<int?, DvInt4>(index, x => x ?? DvInt4.NA); | ||
return CreateConvertingArrayGetterDelegate<int, int>(index, x => x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a "converting" delegate here to return the same value? #Resolved
@@ -185,25 +185,25 @@ public static bool TryGetDataKind(this Type type, out DataKind kind) | |||
Contracts.CheckValueOrNull(type); | |||
|
|||
// REVIEW: Make this more efficient. Should we have a global dictionary? | |||
if (type == typeof(DvInt1) || type == typeof(sbyte) || type == typeof(sbyte?)) | |||
if (type == typeof(sbyte)) | |||
kind = DataKind.I1; | |||
else if (type == typeof(byte) || type == typeof(byte?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte?
should be removed here as well, to follow the rest of the changes. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is "byte?" mapped to a U1? I thought U1 just mapped to a byte and did not support nullables ....this needs to go away anyways
In reply to: 210963225 [](ancestors = 210963225)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone made a mistake somewhere, didn't realize this was supposed to be inverse to above function. Who knows who, but we ought to correct now.
In reply to: 210965477 [](ancestors = 210965477,210963225)
using RawI2 = Int16; | ||
using RawI4 = Int32; | ||
using RawI8 = Int64; | ||
using I1 = SByte; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to just start using the C# keywords inline in the code? That way readers didn't have to map I2
=> short
in their heads? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally the case except in this one file, if we consider its nature I feel like this would actually impede readability, at least for me.
So:
AddStd<I2, I1>(Convert);
AddStd<I2, I2>(Convert);
AddStd<I2, I4>(Convert);
AddStd<I2, I8>(Convert);
AddStd<I2, R4>(Convert);
AddStd<I2, R8>(Convert);
AddAux<I2, SB>(Convert);
becomes
AddStd<short, byte>(Convert);
AddStd<short, short>(Convert);
AddStd<short, int>(Convert);
AddStd<short, long>(Convert);
AddStd<short, float>(Convert);
AddStd<short, double>(Convert);
AddAux<short, SB>(Convert);
Maybe. That's hardly disastrous I suppose, and might be easier for some people. Perhaps we ought to revisit though once we've done all the type changes we want to do...
In reply to: 210964331 [](ancestors = 210964331)
private BoolType() | ||
: base(typeof(DvBool), DataKind.BL) | ||
private readonly string _name; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. What's this for? Is it still relevant? #Closed
Change looks OK for now, let's hold off till after 0.5 is cut then we can perhaps do all the changes to the type system in one go. #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @codemzs let's hold off till 0.5 though.
@@ -731,6 +731,12 @@ public void GetMetadata<TValue>(string kind, int col, ref TValue value) | |||
/// </summary> | |||
private const ulong ReaderVersion = MissingTextVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReaderVersion = MissingTextVersion [](start = 28, length = 34)
This was not updated, it should have been. We can read StandardDataTypesVersion
so it should be that. #Resolved
@@ -1090,10 +1096,10 @@ private unsafe Header InitHeader() | |||
throw _host.Except("Cannot read version {0} data, earliest that can be handled is {1}", | |||
Header.VersionToString(header.CompatibleVersion), Header.VersionToString(MetadataVersion)); | |||
} | |||
if (header.CompatibleVersion > ReaderVersion) | |||
if (header.CompatibleVersion > StandardDataTypesVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompatibleVersion > StandardDataTypesVersion [](start = 23, length = 44)
Revert these two lines please. #Resolved
/// The first version that removes DvTypes and uses .NET standard | ||
/// data types. | ||
/// </summary> | ||
private const ulong StandardDataTypesVersion = 0x0001000100010006; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StandardDataTypesVersion [](start = 28, length = 24)
Also will need to update the constant in Header.cs
. #Resolved
Sounds good, thanks for reviewing. I was wondering if we should bump the version number for every type conversion PR check-in as this will allow us to check-in PRs without waiting for other PRs to close, thoughts? In reply to: 417051605 [](ancestors = 417051605) |
|
||
Assert.True(error); | ||
|
||
//5. Missing value in text to int. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//5. Missing value in text to int. [](start = 12, length = 34)
Empty string #Resolved
mapper(ref src, ref dst); | ||
Assert.Equal(default, dst); | ||
|
||
//6. Empty string in text to sbyte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Empty string in text to sbyte. [](start = 14, length = 33)
Missing value #Resolved
…to types # Conflicts: # test/Microsoft.ML.TestFramework/DataPipe/TestDataPipeBase.cs
I think we can avoid in those cases since I feel like we can easily just have separate codecs probably. In this case we could not avoid I think because we were using the same codec for both purposes. In reply to: 417434524 [](ancestors = 417434524,417051605) |
datatypes.idv is still shown as being modified in latest commit #Resolved |
…to types # Conflicts: # test/BaselineOutput/SingleDebug/SavePipe/TestParquetPrimitiveDataTypes-Data.txt # test/BaselineOutput/SingleDebug/SavePipe/TestParquetPrimitiveDataTypes-Schema.txt # test/BaselineOutput/SingleRelease/SavePipe/TestParquetPrimitiveDataTypes-Data.txt # test/BaselineOutput/SingleRelease/SavePipe/TestParquetPrimitiveDataTypes-Schema.txt # test/data/Parquet/alltypes.parquet
Hi @codemzs you may want to merge master again, I think our work on APIs and additional tests for pigsty may interfere. #Resolved |
Done. In reply to: 418545763 [](ancestors = 418545763) |
This change was committed via #863 |
This change also removes missing value handling for sbyte, short, int and long because default of these values is a null and that does not fit well with sparse vector architecture where default for missing values is a zero. fixes #673