-
Notifications
You must be signed in to change notification settings - Fork 234
Reduce allocations when parsing files #715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @mattpwhite! I have comment on some of the complexity.
@@ -197,14 +197,40 @@ public string[] ReferencedFiles | |||
/// </summary> | |||
/// <param name="text">Input string to be split up into lines.</param> | |||
/// <returns>The lines in the string.</returns> | |||
public static IEnumerable<string> GetLines(string text) | |||
public static IList<string> GetLines(string text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to replace all of this with calling
text.Split(Environment.NewLine)
That will be platform agnostic - or rather leave it up to .NET Core to do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. You'd either end up with too many empty strings, or not enough. If you're expecting the index into the list of lines to correspond to the line number in the file, it wouldn't be correct...
PS C:\> "foo`r`nbar".Split([environment]::newline).Length
3
PS C:\> "foo`r`n`r`nbar".Split([environment]::newline, [StringSplitOptions]::RemoveEmptyEntries).Length
2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not replicating that behaviour you're seeing there:
> "foo`r`nbar".Split("`r`n").Length
2
> "foo`nbar".Split([environment]::NewLine).Length
2
I'd imagine we want to keep empty lines for fidelity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's a core vs framework difference. Core has an overload for string
, the framework does not. So on PS 5.1, it picks the (I think?) params char[]
version and coerces the string to that and we get the undesired behavior.
PS C:\> $psversiontable
Name Value
---- -----
PSVersion 5.1.15063.1206
PSEdition Desktop
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
BuildVersion 10.0.15063.1206
CLRVersion 4.0.30319.42000
WSManStackVersion 3.0
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
PS C:\> "foo`r`nbar".Split("`r`n").Length
3
PS C:\> 'a'.split
OverloadDefinitions
-------------------
string[] Split(Params char[] separator)
string[] Split(char[] separator, int count)
string[] Split(char[] separator, System.StringSplitOptions options)
string[] Split(char[] separator, int count, System.StringSplitOptions options)
string[] Split(string[] separator, System.StringSplitOptions options)
string[] Split(string[] separator, int count, System.StringSplitOptions options)
PS C:\> "foo`r`nbar".Split(([string[]]"`r`n"), [StringSplitOptions]::None).Length
2
Core has a string
version and the options has a default value (https://github.com/dotnet/coreclr/blob/0fbd855e38bc3ec269479b5f6bf561dcfd67cbb6/src/System.Private.CoreLib/shared/System/String.Manipulation.cs#L1244), so that gets picked.
PS C:\> $PSVersionTable
Name Value
---- -----
PSVersion 6.1.0-preview.4
PSEdition Core
GitCommitId 6.1.0-preview.4
OS Microsoft Windows 10.0.15063
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
PS C:\> "foo`r`nbar".Split("`r`n").Length
2
PS C:\> 'a'.split
OverloadDefinitions
-------------------
string[] Split(char separator, System.StringSplitOptions options)
string[] Split(char separator, int count, System.StringSplitOptions options)
string[] Split(Params char[] separator)
string[] Split(char[] separator, int count)
string[] Split(char[] separator, System.StringSplitOptions options)
string[] Split(char[] separator, int count, System.StringSplitOptions options)
string[] Split(string separator, System.StringSplitOptions options)
string[] Split(string separator, int count, System.StringSplitOptions options)
string[] Split(string[] separator, System.StringSplitOptions options)
string[] Split(string[] separator, int count, System.StringSplitOptions options)
PS C:\> "foo`r`nbar".Split("`r`n".ToCharArray()).Length
3
All of that said, I think it's still broken:
PS C:\> "foo`nbar".Split("`r`n").Length
1
There are lots of valid reasons to be dealing with Unix line-broken files on Windows (and to a lesser extent vice-versa), so I don't think relying on Environment.NewLine
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. In that case I'm reasonably satisfied the complexity is unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about one of these solutions:
https://stackoverflow.com/questions/1547476/easiest-way-to-split-a-string-on-newlines-in-net
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringReader
appears promising, Split
not so much, especially on the full framework. The complicated code still edges out StringReader on execution time, but not allocations, and it's not by a lot. Let me know if you'd prefer the StringReader method and I'll update the PR. With either implementation, giving the List a hint on capacity with a crude heuristic of lines averaging 50 chars (the actual average across a few hundred thousand lines of real PS I had handy was 43) seems like a marginal win with little downside.
Code: https://gist.github.com/mattpwhite/c2ec7bcded6645d529574969f65e7d4f
Results (using a ~1300 line, ~55000 char source file)
Framework:
// * Summary *
BenchmarkDotNet=v0.11.0, OS=Windows 10.0.15063.1206 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
[Host] : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3130.0
RyuJitX64 : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3130.0
Job=RyuJitX64 Jit=RyuJit Platform=X64
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
-------------------------- |----------:|----------:|----------:|---------:|---------:|---------:|----------:|
Original | 192.61 us | 0.6213 us | 0.5811 us | 133.3008 | 66.6504 | 66.6504 | 581.12 KB |
Complicated | 74.19 us | 0.3941 us | 0.3686 us | 33.5693 | 14.6484 | - | 167.45 KB |
ComplicatedGuessCapacity | 70.97 us | 0.3468 us | 0.3244 us | 32.8369 | 11.8408 | - | 161.47 KB |
Split | 349.20 us | 1.2177 us | 1.1391 us | 133.3008 | 133.3008 | 133.3008 | 582.84 KB |
StringReader | 78.63 us | 0.3658 us | 0.3422 us | 34.7900 | 12.3291 | - | 167.49 KB |
StringReaderGuessCapacity | 77.80 us | 0.2394 us | 0.2123 us | 33.3252 | 10.9863 | - | 161.51 KB |
Core:
BenchmarkDotNet=v0.11.0, OS=Windows 10.0.15063.1206 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=2.1.302
[Host] : .NET Core ? (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
RyuJitX64 : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
Job=RyuJitX64 Jit=RyuJit Platform=X64
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
-------------------------- |----------:|----------:|----------:|--------:|--------:|----------:|
Original | 134.68 us | 0.7214 us | 0.6748 us | 62.2559 | 21.2402 | 299.61 KB |
Complicated | 69.97 us | 0.4752 us | 0.4213 us | 34.9121 | 11.8408 | 167.45 KB |
ComplicatedGuessCapacity | 67.13 us | 0.3257 us | 0.2887 us | 34.9121 | 8.1787 | 161.47 KB |
Split | 207.69 us | 1.1943 us | 1.0587 us | 31.0059 | 9.5215 | 145.49 KB |
StringReader | 80.09 us | 0.3387 us | 0.2828 us | 34.5459 | 11.8408 | 167.48 KB |
StringReaderGuessCapacity | 77.07 us | 0.3251 us | 0.3041 us | 33.4473 | 9.7656 | 161.5 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringReader
seems like the way to go
{ | ||
var unixFile = ScriptFileChangeTests.CreateScriptFile(TestStringUnix); | ||
Assert.Equal(scriptFile.FileLines, unixFile.FileLines); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you added tests 😍
On the CLA, I have to wait for my legal department to let me sign. Sorta kicking myself, I actually got a ways down that road with something in corefx a while back, but the change was small enough that I was told (by the corefx maintainers) that it wasn't necessary, so I stopped nagging legal. |
The previous implementation used LINQ on a relatively hot path, returned a lazy IEnumerable that was always eagerly converted to a List or Array and allocated two strings per line when Windows linebreaks were used. Profiling indicates that PSES spends a huge percentage of its time in GC, so switching to a slightly more complex implementation that allocates less than half as much seems justified.
71aa6ee
to
1f59612
Compare
Changed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing the feedback :)
Let me know when your legal department gives you the ok 👍
I don't think we can do what the .NET team did because this change touches a few files.
My legal team signed off, CLA signed. |
The previous implementation used LINQ on a relatively hot path, returned a lazy IEnumerable that was always eagerly converted to a List or Array and allocated two strings per line when Windows linebreaks were used. Profiling indicates that PSES spends a huge percentage of its time in GC, so switching to a slightly more complex implementation that allocates less than half as much seems justified.
This is some of the "low hanging fruit" I mentioned in PowerShell/vscode-powershell#1342.