Skip to content

Explain the relationship between windows-1252, Latin1, and ASCII #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 39 additions & 1 deletion encoding.bs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,18 @@ Markup Shorthands: css off
Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeoptions textdecodeoptions,index section-index
</pre>

<pre class=biblio>
{
"ISO8859-1": {
"href": "https://www.iso.org/standard/28245.html",
"title": "Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1",
"publisher": "International Organization for Standardization (ISO)",
"status": "Published",
"date": "April 1998"
}
}
</pre>

<link rel=stylesheet href=visualization-colors.css>


Expand Down Expand Up @@ -568,7 +580,10 @@ prescribes, as that is necessary to be compatible with deployed content.
<tr><td>"<code>windows-1251</code>"
<tr><td>"<code>x-cp1251</code>"
<tr>
<td rowspan=17><a>windows-1252</a>
<td rowspan=17>
<a>windows-1252</a>
<p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical
"Latin1" and "ASCII" concepts.
<td>"<code>ansi_x3.4-1968</code>"
<tr><td>"<code>ascii</code>"
<tr><td>"<code>cp1252</code>"
Expand Down Expand Up @@ -732,6 +747,29 @@ part of the ISO 8859 series. In particular, the necessity of the inclusion of <a
and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no
plans to remove these.</p>

<div class=note id=note-latin1-ascii>
<p>The <a>windows-1252</a> <a for=/>encoding</a> has various <a for=encoding>labels</a>, such as
"<code>latin1</code>", "<code>iso-8859-1</code>", and "<code>ascii</code>", which have historically
been confusing for developers. On the web, and in any software that seeks to be web-compatible by
implementing this standard, these are synonyms: "<code>latin1</code>" and "<code>ascii</code>" are
just labels for <a>windows-1252</a>, and any software following this standard will, for example,
decode 0x80 as U+20AC (€) when asked for the "Latin1" or "ASCII" decoding of that byte.

<p>Software that does not follow this standard does not always give the same answers. The root of
this is that the original document that specified Latin1 (ISO/IEC 8859-1) did not provide any
mappings for bytes in the inclusive ranges 0x00 to 0x1F or 0x7F to 0x9F. Similarly, the original
documents that specified ASCII (ISO/IEC 646, among others) did not provide any mappings for bytes
in the inclusive range 0x80 to 0xFF. This means different software has chosen different code point
mappings for those bytes when asked to use Latin1 or ASCII encodings. Web browsers and
browser-compatible software have chosen to map those bytes according to <a>windows-1252</a>, which
is a superset of both, and this choice was codified in this standard. Other software throws errors,
or uses <a>isomorphic decoding</a>, or other mappings. [[ISO8859-1]] [[ISO646]]

<p>As such, implementers and developers need to be careful whenever they are using libraries which
expose APIs in terms of "Latin1" or "ASCII". It's very possible such libraries will not give
answers in line with this standard, if they have chosen other behaviors for the bytes which were
left undefined in the original specifications.
</div>

<h3 id=output-encodings>Output encodings</h3>

Expand Down
9 changes: 8 additions & 1 deletion tools-label-table.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,14 @@ def create_table():
if label_len > 1:
rowspan = " rowspan=" + str(label_len)

table += " <tr>\n <td" + rowspan + "><a>" + encoding["name"] + "</a>"
if encoding["name"] != "windows-1252":
table += " <tr>\n <td" + rowspan + "><a>" + encoding["name"] + "</a>"
else:
table += f""" <tr>
<td{rowspan}>
<a>{encoding["name"]}</a>
<p class=note>See <a href="#note-latin1-ascii">below</a> for the relationship to historical
"Latin1" and "ASCII" concepts."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is missing a newline at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one, and indeed the Python script output is more consistent now, but the main text doesn't match the Python script output: all the blank lines are omitted. Not sure what to do about that, if anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a more detailed look and pushed a fixup. Should be okay to land now from my perspective.

i = 0
for label in encoding["labels"]:
if i > 0:
Expand Down