Skip to content

Explain the relationship between windows-1252, Latin1, and ASCII #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 40 additions & 1 deletion encoding.bs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,18 @@ Markup Shorthands: css off
Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeoptions textdecodeoptions,index section-index
</pre>

<pre class=biblio>
{
"ISO8859-1": {
"href": "https://www.iso.org/standard/28245.html",
"title": "Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1",
"publisher": "International Organization for Standardization (ISO)",
"status": "Published",
"date": "April 1998"
}
}
</pre>

<link rel=stylesheet href=visualization-colors.css>


Expand Down Expand Up @@ -568,7 +580,10 @@ prescribes, as that is necessary to be compatible with deployed content.
<tr><td>"<code>windows-1251</code>"
<tr><td>"<code>x-cp1251</code>"
<tr>
<td rowspan=17><a>windows-1252</a>
<td rowspan=17>
<a>windows-1252</a>
<p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical
"Latin1" and "ASCII" concepts.
<td>"<code>ansi_x3.4-1968</code>"
<tr><td>"<code>ascii</code>"
<tr><td>"<code>cp1252</code>"
Expand Down Expand Up @@ -732,6 +747,30 @@ part of the ISO 8859 series. In particular, the necessity of the inclusion of <a
and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no
plans to remove these.</p>

<div class=note id=note-latin1-ascii>
<p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a> like
"<code>latin1</code>", "<code>iso-8859-1</code>", "<code>ascii</code>", etc. which have
historically been confusing for developers. On the web, and in any software that seeks to be
web-compatible by implementing the Encoding Standard, these are synonyms: "<code>latin1</code>" and
"<code>ascii</code>" are just labels for <a>windows-1252</a>, and any software following this
standard will, for example, decode 0x80 as U+20AC (€) when asked for the Latin1 or ASCII decoding
of that byte.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this is probably okay, but what gives me pause is that the Encoding standard doesn't define Latin1 or ASCII encodings (it only defines them as labels). So if software exposes those encodings, who knows what they might do. So perhaps we should make that distinction clearer, in that this will likely happen for software that takes a label and some bytes as input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to phrase this carefully to avoid giving the impression that latin1 or ASCII are encodings, and instead be clear that they are inputs to the common algorithm category that takes (byte sequence, encoding label) parameters.

On the web that algorithm category is well-formalized with the concepts of actual encodings vs. labels, but in larger software it's more vague with e.g. functions named DecodeLatin1 or similar.

My attempt was "when asked for the Latin1 or ASCII decoding of that byte", but if you have a different suggestion I'd be interested. The main thing is that I don't want to only constrain us to describing the web API case where we have a clear label/encoding divide, but instead the more general category of "please decode some bytes" algorithms across all software.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I have with this is that browsers typically have "Latin1" code paths that are very much aligned with the Unicode view of the world and not windows-1252. So for complicated software it very much depends on how or what you ask.

I also don't really have a good rephrasing that would account for that. Maybe put Latin1 and ASCII in quotes like below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand what's making you uneasy, but I think adding the quotes is reasonable, so I'll do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, browsers typically have Latin1 or ASCII encoding implementations that don't do windows-1252. But obviously they also "Latin1" and "ascii" labels to windows-1252. So they're on both sides of the divide you're trying to draw.


<p>Software that does not follow the Encoding Standard does not always give the same answers. The
root of this is that the original document that specified Latin1 (ISO/IEC 8859-1), did not provide
any mappings for bytes in the inclusive ranges 0x00–0x1F or 0x7F–0x9F. Similarly, the original
documents that specified ASCII (ISO/IEC 646, among others) did not provide any mappings for bytes
in the inclusive range 0x80–0xFF. This means different software has chosen different code point
mappings for those bytes when asked to use Latin1 or ASCII encodings. Web browsers and
browser-compatible software have chosen to map those bytes according to <a>windows-1252</a>, which
is a superset of both, and this was codified in the Encoding Standard. Other software throws
errors, or uses <a>isomorphic decoding</a>, or other mappings. [[ISO8859-1]] [[ISO646]]

<p>As such, implementers and developers need to be careful whenever they are using libraries which
expose APIs in terms of "Latin1" or "ASCII". It's very possible such libraries will not give
answers in line with the Encoding Standard, if they have chosen other behaviors for the bytes which
were left undefined in the original specifications.
</div>

<h3 id=output-encodings>Output encodings</h3>

Expand Down