Skip to content

Reduce allocations when parsing files #715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 7, 2018

Conversation

mattpwhite
Copy link
Contributor

The previous implementation used LINQ on a relatively hot path, returned a lazy IEnumerable that was always eagerly converted to a List or Array and allocated two strings per line when Windows linebreaks were used. Profiling indicates that PSES spends a huge percentage of its time in GC, so switching to a slightly more complex implementation that allocates less than half as much seems justified.

This is some of the "low hanging fruit" I mentioned in PowerShell/vscode-powershell#1342.

@msftclas
Copy link

msftclas commented Jul 31, 2018

CLA assistant check
All CLA requirements met.

Copy link
Member

@TylerLeonhardt TylerLeonhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @mattpwhite! I have comment on some of the complexity.

@@ -197,14 +197,40 @@ public string[] ReferencedFiles
/// </summary>
/// <param name="text">Input string to be split up into lines.</param>
/// <returns>The lines in the string.</returns>
public static IEnumerable<string> GetLines(string text)
public static IList<string> GetLines(string text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to replace all of this with calling

text.Split(Environment.NewLine)

That will be platform agnostic - or rather leave it up to .NET Core to do the right thing.

Link:
https://docs.microsoft.com/en-us/dotnet/api/system.environment.newline?view=netcore-2.0#System_Environment_NewLine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. You'd either end up with too many empty strings, or not enough. If you're expecting the index into the list of lines to correspond to the line number in the file, it wouldn't be correct...

PS C:\> "foo`r`nbar".Split([environment]::newline).Length
3
PS C:\> "foo`r`n`r`nbar".Split([environment]::newline, [StringSplitOptions]::RemoveEmptyEntries).Length
2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not replicating that behaviour you're seeing there:

> "foo`r`nbar".Split("`r`n").Length
2
> "foo`nbar".Split([environment]::NewLine).Length  
2

I'd imagine we want to keep empty lines for fidelity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's a core vs framework difference. Core has an overload for string, the framework does not. So on PS 5.1, it picks the (I think?) params char[] version and coerces the string to that and we get the undesired behavior.

PS C:\> $psversiontable

Name                           Value
----                           -----
PSVersion                      5.1.15063.1206
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.15063.1206
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1


PS C:\> "foo`r`nbar".Split("`r`n").Length
3
PS C:\> 'a'.split

OverloadDefinitions
-------------------
string[] Split(Params char[] separator)
string[] Split(char[] separator, int count)
string[] Split(char[] separator, System.StringSplitOptions options)
string[] Split(char[] separator, int count, System.StringSplitOptions options)
string[] Split(string[] separator, System.StringSplitOptions options)
string[] Split(string[] separator, int count, System.StringSplitOptions options)



PS C:\> "foo`r`nbar".Split(([string[]]"`r`n"), [StringSplitOptions]::None).Length
2

Core has a string version and the options has a default value (https://github.com/dotnet/coreclr/blob/0fbd855e38bc3ec269479b5f6bf561dcfd67cbb6/src/System.Private.CoreLib/shared/System/String.Manipulation.cs#L1244), so that gets picked.

PS C:\> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      6.1.0-preview.4
PSEdition                      Core
GitCommitId                    6.1.0-preview.4
OS                             Microsoft Windows 10.0.15063
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0


PS C:\>  "foo`r`nbar".Split("`r`n").Length
2
PS C:\> 'a'.split

OverloadDefinitions
-------------------
string[] Split(char separator, System.StringSplitOptions options)
string[] Split(char separator, int count, System.StringSplitOptions options)
string[] Split(Params char[] separator)
string[] Split(char[] separator, int count)
string[] Split(char[] separator, System.StringSplitOptions options)
string[] Split(char[] separator, int count, System.StringSplitOptions options)
string[] Split(string separator, System.StringSplitOptions options)
string[] Split(string separator, int count, System.StringSplitOptions options)
string[] Split(string[] separator, System.StringSplitOptions options)
string[] Split(string[] separator, int count, System.StringSplitOptions options)



PS C:\>  "foo`r`nbar".Split("`r`n".ToCharArray()).Length
3

All of that said, I think it's still broken:

PS C:\>  "foo`nbar".Split("`r`n").Length
1

There are lots of valid reasons to be dealing with Unix line-broken files on Windows (and to a lesser extent vice-versa), so I don't think relying on Environment.NewLine works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. In that case I'm reasonably satisfied the complexity is unavoidable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringReader appears promising, Split not so much, especially on the full framework. The complicated code still edges out StringReader on execution time, but not allocations, and it's not by a lot. Let me know if you'd prefer the StringReader method and I'll update the PR. With either implementation, giving the List a hint on capacity with a crude heuristic of lines averaging 50 chars (the actual average across a few hundred thousand lines of real PS I had handy was 43) seems like a marginal win with little downside.

Code: https://gist.github.com/mattpwhite/c2ec7bcded6645d529574969f65e7d4f

Results (using a ~1300 line, ~55000 char source file)

Framework:

// * Summary *

BenchmarkDotNet=v0.11.0, OS=Windows 10.0.15063.1206 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
  [Host]    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3130.0
  RyuJitX64 : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3130.0

Job=RyuJitX64  Jit=RyuJit  Platform=X64

                    Method |      Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
-------------------------- |----------:|----------:|----------:|---------:|---------:|---------:|----------:|
                  Original | 192.61 us | 0.6213 us | 0.5811 us | 133.3008 |  66.6504 |  66.6504 | 581.12 KB |
               Complicated |  74.19 us | 0.3941 us | 0.3686 us |  33.5693 |  14.6484 |        - | 167.45 KB |
  ComplicatedGuessCapacity |  70.97 us | 0.3468 us | 0.3244 us |  32.8369 |  11.8408 |        - | 161.47 KB |
                     Split | 349.20 us | 1.2177 us | 1.1391 us | 133.3008 | 133.3008 | 133.3008 | 582.84 KB |
              StringReader |  78.63 us | 0.3658 us | 0.3422 us |  34.7900 |  12.3291 |        - | 167.49 KB |
 StringReaderGuessCapacity |  77.80 us | 0.2394 us | 0.2123 us |  33.3252 |  10.9863 |        - | 161.51 KB |

Core:

BenchmarkDotNet=v0.11.0, OS=Windows 10.0.15063.1206 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=2.1.302
  [Host]    : .NET Core ? (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
  RyuJitX64 : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Job=RyuJitX64  Jit=RyuJit  Platform=X64

                    Method |      Mean |     Error |    StdDev |   Gen 0 |   Gen 1 | Allocated |
-------------------------- |----------:|----------:|----------:|--------:|--------:|----------:|
                  Original | 134.68 us | 0.7214 us | 0.6748 us | 62.2559 | 21.2402 | 299.61 KB |
               Complicated |  69.97 us | 0.4752 us | 0.4213 us | 34.9121 | 11.8408 | 167.45 KB |
  ComplicatedGuessCapacity |  67.13 us | 0.3257 us | 0.2887 us | 34.9121 |  8.1787 | 161.47 KB |
                     Split | 207.69 us | 1.1943 us | 1.0587 us | 31.0059 |  9.5215 | 145.49 KB |
              StringReader |  80.09 us | 0.3387 us | 0.2828 us | 34.5459 | 11.8408 | 167.48 KB |
 StringReaderGuessCapacity |  77.07 us | 0.3251 us | 0.3041 us | 33.4473 |  9.7656 |  161.5 KB |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringReader seems like the way to go

{
var unixFile = ScriptFileChangeTests.CreateScriptFile(TestStringUnix);
Assert.Equal(scriptFile.FileLines, unixFile.FileLines);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you added tests 😍

@mattpwhite
Copy link
Contributor Author

On the CLA, I have to wait for my legal department to let me sign. Sorta kicking myself, I actually got a ways down that road with something in corefx a while back, but the change was small enough that I was told (by the corefx maintainers) that it wasn't necessary, so I stopped nagging legal.

The previous implementation used LINQ on a relatively hot path,
returned a lazy IEnumerable that was always eagerly converted to a
List or Array and allocated two strings per line when Windows
linebreaks were used.  Profiling indicates that PSES spends a huge
percentage of its time in GC, so switching to a slightly more complex
implementation that allocates less than half as much seems justified.
@mattpwhite
Copy link
Contributor Author

Changed to StringReader. Noticed that we got 0 lines back from the empty string, wasn't sure if any assumptions were being made about > 0 lines somewhere so I just special cased that to be sure and return a 1-element list, which is what the previous implementation did.

Copy link
Member

@TylerLeonhardt TylerLeonhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing the feedback :)

Let me know when your legal department gives you the ok 👍

I don't think we can do what the .NET team did because this change touches a few files.

@mattpwhite
Copy link
Contributor Author

My legal team signed off, CLA signed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants