-
Notifications
You must be signed in to change notification settings - Fork 82
Explain the relationship between windows-1252, Latin1, and ASCII #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,18 @@ Markup Shorthands: css off | |
Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeoptions textdecodeoptions,index section-index | ||
</pre> | ||
|
||
<pre class=biblio> | ||
{ | ||
"ISO8859-1": { | ||
"href": "https://www.iso.org/standard/28245.html", | ||
"title": "Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1", | ||
"publisher": "International Organization for Standardization (ISO)", | ||
"status": "Published", | ||
"date": "April 1998" | ||
} | ||
} | ||
</pre> | ||
|
||
<link rel=stylesheet href=visualization-colors.css> | ||
|
||
|
||
|
@@ -568,7 +580,10 @@ prescribes, as that is necessary to be compatible with deployed content. | |
<tr><td>"<code>windows-1251</code>" | ||
<tr><td>"<code>x-cp1251</code>" | ||
<tr> | ||
<td rowspan=17><a>windows-1252</a> | ||
<td rowspan=17> | ||
<a>windows-1252</a> | ||
<p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical | ||
"Latin1" and "ASCII" concepts. | ||
<td>"<code>ansi_x3.4-1968</code>" | ||
<tr><td>"<code>ascii</code>" | ||
<tr><td>"<code>cp1252</code>" | ||
|
@@ -732,6 +747,30 @@ part of the ISO 8859 series. In particular, the necessity of the inclusion of <a | |
and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no | ||
plans to remove these.</p> | ||
|
||
<div class=note id=note-latin1-ascii> | ||
<p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a> like | ||
"<code>latin1</code>", "<code>iso-8859-1</code>", "<code>ascii</code>", etc. which have | ||
domenic marked this conversation as resolved.
Show resolved
Hide resolved
|
||
historically been confusing for developers. On the web, and in any software that seeks to be | ||
web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and | ||
domenic marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this | ||
standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding | ||
of that byte. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think overall this is probably okay, but what gives me pause is that the Encoding standard doesn't define Latin1 or ASCII encodings (it only defines them as labels). So if software exposes those encodings, who knows what they might do. So perhaps we should make that distinction clearer, in that this will likely happen for software that takes a label and some bytes as input. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to phrase this carefully to avoid giving the impression that latin1 or ASCII are encodings, and instead be clear that they are inputs to the common algorithm category that takes (byte sequence, encoding label) parameters. On the web that algorithm category is well-formalized with the concepts of actual encodings vs. labels, but in larger software it's more vague with e.g. functions named My attempt was "when asked for the Latin1 or ASCII decoding of that byte", but if you have a different suggestion I'd be interested. The main thing is that I don't want to only constrain us to describing the web API case where we have a clear label/encoding divide, but instead the more general category of "please decode some bytes" algorithms across all software. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask. I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think I fully understand what's making you uneasy, but I think adding the quotes is reasonable, so I'll do that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, browsers typically have Latin1 or ASCII encoding implementations that don't do windows-1252. But obviously they also "Latin1" and "ascii" labels to windows-1252. So they're on both sides of the divide you're trying to draw. |
||
|
||
<p>Software that does not follow the Encoding Standard does not always give the same answers. The | ||
root of this is that the original document that specified Latin1 (ISO/IEC 8859-1), did not provide | ||
any mappings for bytes in the inclusive ranges 0x00–0x1F or 0x7F–0x9F. Similarly, the original | ||
domenic marked this conversation as resolved.
Show resolved
Hide resolved
|
||
documents that specified ASCII (ISO/IEC 646, among others) did not provide any mappings for bytes | ||
in the inclusive range 0x80–0xFF. This means different software has chosen different code point | ||
mappings for those bytes when asked to use Latin1 or ASCII encodings. Web browsers and | ||
browser-compatible software have chosen to map those bytes according to <a>windows-1252</a>, which | ||
is a superset of both, and this was codified in the Encoding Standard. Other software throws | ||
errors, or uses <a>isomorphic decoding</a>, or other mappings. [[ISO8859-1]] [[ISO646]] | ||
|
||
<p>As such, implementers and developers need to be careful whenever they are using libraries which | ||
expose APIs in terms of "Latin1" or "ASCII". It's very possible such libraries will not give | ||
answers in line with the Encoding Standard, if they have chosen other behaviors for the bytes which | ||
were left undefined in the original specifications. | ||
</div> | ||
|
||
<h3 id=output-encodings>Output encodings</h3> | ||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.