-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[SR-1280] Unicode conformance readLine #43888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The Radar contains this additional information: See http://unicode.org/reports/tr14/ BK: Mandatory Break (A) (Non-tailorable) See also the section 5.8, Newline Guidelines, in the Unicode spec. |
Comment by Han Sangjin (JIRA) I just read the section 5.8, Newline Guidelines. It has superseeded UAX #13 http://unicode.org/reports/tr13, and I think it has a explicit information about the behavior of the readline. But I'm confused about http://unicode.org/reports/tr14/, LINE BREAKING ALGORITHM. Copied from the summary of the tr14, I think the LINE BREAKING ALGORITHM is only for display, not for readline. Could you confirm this ? |
Comment by Han Sangjin (JIRA) I stopped this task. When I started, I hoped a common readLine() code which is platform independent. And I wrote a Unicode Newline recognizer in C++ which used getc(stdin) and could be run on Windows and Linux. Its function worked, it stopped at NLF, LS, FF, or PS. (NLF is CR, LF, CRLF, NEL) But the speed was very slower than old code that used getline(), it was about 9x slower. (loop readLine for 2MB binary file, Linux) The root cause was not the increased newline code. The getline() scanned the '\n' in the internal buffer using memchr(), while my code calling the getc() for each bytes. To meet a similar performance, in my opinion, it should be implemented in the low level which access the buffer and is dependent on platform. Any ideas are welcome. |
I'm not sure if this should be done. On the one hand, it makes sense to change the implementation to be consistent with Character or Unicode.Scalar's notion of newlines. On the other hand, this is reading from stdin (i.e. from the shell) and it would make sense to read until encountering the platform-specific line delimiter, for consistency with the shell. |
@@milseman The recommendations in §5.8 Newline Guidelines of the Unicode Standard include:
and
My pull request at apple/swift#21586 follows this recommendation, and also stops at VT to match the Java has |
@compnerd, what approach do you think would make sense for Windows? |
Windows uses text mode by default for Can we require text mode for the |
That sounds great to me; IIUC, the whole point of these facilities is for interacting with the system/command-line. But I can see how others might disagree and I'd like to see some discussion. Could you open a post on the forums under Development/Standard Library? |
Thanks for following up and chasing this down, BTW! |
I've closed my pull request at apple/swift#21586. There were two possible implementations:
|
Additional Detail from JIRA
md5: 001f9dad12b054615ab07d0ec9cf6287
Issue Description:
readLine() is implemented in InputStream.swift, and it contains following comment.
// FIXME: Unicode conformance. To fix this, we need to reimplement the
// code we call above to get a line, since it will only stop on LF.
//
// <rdar://problem/20013999> Recognize Unicode newlines in readLine()
//
// Recognize only LF and CR+LF combinations for now.
I cannot access rdar://problem/*, so posted here.
The text was updated successfully, but these errors were encountered: