Mac encoding
-
Hi everyone.
A friend sent me a plain text document he created on a Mac. We’re both Italian, so naturally there were a lot of è, é, ì characters involved, which i cannot see correctly on his file. By inspecting the document with a hex editor, I’ve come to the conclusion he must be using this encoding:
https://en.wikipedia.org/wiki/Mac_OS_Roman
Normally I would try and encode the document by it and then, perhaps, convert it to UTF-8 so I can work on it more portably. But I can’t find an entry for Mac OS Roman in the encodings list.
Is it completely unsupported?
Thanks. -
thanks for this post :D
ما يهمك سيدتي -
Hello Valerio,
As you’re Italian, you probably use, as default ANSI encoding, the Windows-1252 encoding. Refer the link below :
https://msdn.microsoft.com/en-us/goglobal/cc305145
You may verify that I’m not mistaken, by opening the Character Panel ( Menu option Edit - Character Panel ). The list displayed should be the same as the Microsoft table, above !
Here is below, a table of the MAC OS Roman encoding, for characters over
\x7F
only. Remember that characters, with code-point <\x80
, are always identical, in any Windows, OEM, ISO or UTF-8 encoding.•-------------------------------------------------------------------------------------------------• | MAC OS Roman Encoding ( Windows Code Page 10000 ) | •--------------•-------•------------------•--------•----------------------------------------------• | MAC OS Roman | Char. | Windows-1252 | UNI- | UNICODE | |--------------| |------------------• CODE | | | Hexa | Deci. | Glyph | Encoded | Hexa | Value | Character Name | •------•-------•-------•---------•--------•--------•----------------------------------------------• | 80 | 128 | Ä | | C4 | 00C4 | LATIN CAPITAL LETTER A WITH DIAERESIS | | 81 | 129 | Å | | C5 | 00C5 | LATIN CAPITAL LETTER A WITH RING ABOVE | | 82 | 130 | Ç | | C7 | 00C7 | LATIN CAPITAL LETTER C WITH CEDILLA | | 83 | 131 | É | | C9 | 00C9 | LATIN CAPITAL LETTER E WITH ACUTE | | 84 | 132 | Ñ | | D1 | 00D1 | LATIN CAPITAL LETTER N WITH TILDE | | 85 | 133 | Ö | | D6 | 00D6 | LATIN CAPITAL LETTER O WITH DIAERESIS | | 86 | 134 | Ü | | DC | 00DC | LATIN CAPITAL LETTER U WITH DIAERESIS | | 87 | 135 | á | | E1 | 00E1 | LATIN SMALL LETTER A WITH ACUTE | | 88 | 136 | à | | E0 | 00E0 | LATIN SMALL LETTER A WITH GRAVE | | 89 | 137 | â | | E2 | 00E2 | LATIN SMALL LETTER A WITH CIRCUMFLEX | | 8A | 138 | ä | | E4 | 00E4 | LATIN SMALL LETTER A WITH DIAERESIS | | 8B | 139 | ã | | E3 | 00E3 | LATIN SMALL LETTER A WITH TILDE | | 8C | 140 | å | | E5 | 00E5 | LATIN SMALL LETTER A WITH RING ABOVE | | 8D | 141 | ç | | E7 | 00E7 | LATIN SMALL LETTER C WITH CEDILLA | | 8E | 142 | é | | E9 | 00E9 | LATIN SMALL LETTER E WITH ACUTE | | 8F | 143 | è | | E8 | 00E8 | LATIN SMALL LETTER E WITH GRAVE | •------•-------•-------•---------•--------•--------•----------------------------------------------• | 90 | 144 | ê | | EA | 00EA | LATIN SMALL LETTER E WITH CIRCUMFLEX | | 91 | 145 | ë | | EB | 00EB | LATIN SMALL LETTER E WITH DIAERESIS | | 92 | 146 | í | | ED | 00ED | LATIN SMALL LETTER I WITH ACUTE | | 93 | 147 | ì | | EC | 00EC | LATIN SMALL LETTER I WITH GRAVE | | 94 | 148 | î | | EE | 00EE | LATIN SMALL LETTER I WITH CIRCUMFLEX | | 95 | 149 | ï | | EF | 00EF | LATIN SMALL LETTER I WITH DIAERESIS | | 96 | 150 | ñ | | F1 | 00F1 | LATIN SMALL LETTER N WITH TILDE | | 97 | 151 | ó | | F3 | 00F3 | LATIN SMALL LETTER O WITH ACUTE | | 98 | 152 | ò | | F2 | 00F2 | LATIN SMALL LETTER O WITH GRAVE | | 99 | 153 | ô | | F4 | 00F4 | LATIN SMALL LETTER O WITH CIRCUMFLEX | | 9A | 154 | ö | | F6 | 00F6 | LATIN SMALL LETTER O WITH DIAERESIS | | 9B | 155 | õ | | F5 | 00F5 | LATIN SMALL LETTER O WITH TILDE | | 9C | 156 | ú | | FA | 00FA | LATIN SMALL LETTER U WITH ACUTE | | 9D | 157 | ù | | F9 | 00F9 | LATIN SMALL LETTER U WITH GRAVE | | 9E | 158 | û | | FB | 00FB | LATIN SMALL LETTER U WITH CIRCUMFLEX | | 9F | 159 | ü | | FC | 00FC | LATIN SMALL LETTER U WITH DIAERESIS | •------•-------•-------•---------•--------•--------•----------------------------------------------• | A0 | 160 | † | | 86 | 2020 | DAGGER | | A1 | 161 | ° | | B0 | 00B0 | DEGREE SIGN | | A2 | 162 | ¢ | | A2 | 00A2 | CENT SIGN | | A3 | 163 | £ | | A3 | 00A3 | POUND SIGN | | A4 | 164 | § | | A7 | 00A7 | SECTION SIGN | | A5 | 165 | • | | 95 | 2022 | BULLET | | A6 | 166 | ¶ | | B6 | 00B6 | PILCROW SIGN | | A7 | 167 | ß | | DF | 00DF | LATIN SMALL LETTER SHARP S | | A8 | 168 | ® | | AE | 00AE | REGISTERED SIGN | | A9 | 169 | © | | A9 | 00A9 | COPYRIGHT SIGN | | AA | 170 | ™ | | 99 | 2122 | TRADE MARK SIGN | | AB | 171 | ´ | | B4 | 00B4 | ACUTE ACCENT | | AC | 172 | ¨ | | A8 | 00A8 | DIAERESIS | | AD | 173 | ≠ | NO | | 2260 | NOT EQUAL TO | | AE | 174 | Æ | | C6 | 00C6 | LATIN CAPITAL LETTER AE | | AF | 175 | Ø | | D8 | 00D8 | LATIN CAPITAL LETTER O WITH STROKE | •------•-------•-------•---------•--------•--------•----------------------------------------------• | B0 | 176 | ∞ | NO | | 221E | INFINITY | | B1 | 177 | ± | | B1 | 00B1 | PLUS-MINUS SIGN | | B2 | 178 | ≤ | NO | | 2264 | LESS-THAN OR EQUAL TO | | B3 | 179 | ≥ | NO | | 2265 | GREATER-THAN OR EQUAL TO | | B4 | 180 | ¥ | | A5 | 00A5 | YEN SIGN | | B5 | 181 | µ | | B5 | 00B5 | MICRO SIGN | | B6 | 182 | ∂ | NO | | 2202 | PARTIAL DIFFERENTIAL | | B7 | 183 | ∑ | NO | | 2211 | N-ARY SUMMATION | | B8 | 184 | ∏ | NO | | 220F | N-ARY PRODUCT | | B9 | 185 | π | NO | | 03C0 | GREEK SMALL LETTER PI | | BA | 186 | ∫ | NO | | 222B | INTEGRAL | | BB | 187 | ª | | AA | 00AA | FEMININE ORDINAL INDICATOR | | BC | 188 | º | | BA | 00BA | MASCULINE ORDINAL INDICATOR | | BD | 189 | Ω | NO | | 03A9 | GREEK CAPITAL LETTER OMEGA | | BE | 190 | æ | | E6 | 00E6 | LATIN SMALL LETTER AE | | BF | 191 | ø | | F8 | 00F8 | LATIN SMALL LETTER O WITH STROKE | •------•-------•-------•---------•--------•--------•----------------------------------------------• | C0 | 192 | ¿ | | BF | 00BF | INVERTED QUESTION MARK | | C1 | 193 | ¡ | | A1 | 00A1 | INVERTED EXCLAMATION MARK | | C2 | 194 | ¬ | | AC | 00AC | NOT SIGN | | C3 | 195 | √ | NO | | 221A | SQUARE ROOT | | C4 | 196 | ƒ | | 83 | 0192 | LATIN SMALL LETTER F WITH HOOK | | C5 | 197 | ≈ | NO | | 2248 | ALMOST EQUAL TO | | C6 | 198 | ∆ | NO | | 2206 | INCREMENT | | C7 | 199 | « | | AB | 00AB | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK | | C8 | 200 | » | | BB | 00BB | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | | C9 | 201 | … | | 85 | 2026 | HORIZONTAL ELLIPSIS | | CA | 202 | | | A0 | 00A0 | NO-BREAK SPACE | | CB | 203 | À | | C0 | 00C0 | LATIN CAPITAL LETTER A WITH GRAVE | | CC | 204 | à | | C3 | 00C3 | LATIN CAPITAL LETTER A WITH TILDE | | CD | 205 | Õ | | D5 | 00D5 | LATIN CAPITAL LETTER O WITH TILDE | | CE | 206 | Œ | | 8C | 0152 | LATIN CAPITAL LIGATURE OE | | CF | 207 | œ | | 9C | 0153 | LATIN SMALL LIGATURE OE | •------•-------•-------•---------•--------•--------•----------------------------------------------• | D0 | 208 | – | | 96 | 2013 | EN DASH | | D1 | 209 | — | | 97 | 2014 | EM DASH | | D2 | 210 | “ | | 93 | 201C | LEFT DOUBLE QUOTATION MARK | | D3 | 211 | ” | | 94 | 201D | RIGHT DOUBLE QUOTATION MARK | | D4 | 212 | ‘ | | 91 | 2018 | LEFT SINGLE QUOTATION MARK | | D5 | 213 | ’ | | 92 | 2019 | RIGHT SINGLE QUOTATION MARK | | D6 | 214 | ÷ | | F7 | 00F7 | DIVISION SIGN | | D7 | 215 | ◊ | NO | | 25CA | LOZENGE | | D8 | 216 | ÿ | | FF | 00FF | LATIN SMALL LETTER Y WITH DIAERESIS | | D9 | 217 | Ÿ | | 9F | 0178 | LATIN CAPITAL LETTER Y WITH DIAERESIS | | DA | 218 | ⁄ | NO | | 2044 | FRACTION SLASH | | DB | 219 | € | | 80 | 20AC | EURO SIGN | | DC | 220 | ‹ | | 8B | 2039 | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | | DD | 221 | › | | 9B | 203A | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | | DE | 222 | fi | NO | | FB01 | LATIN SMALL LIGATURE FI | | DF | 223 | fl | NO | | FB02 | LATIN SMALL LIGATURE FL | •------•-------•-------•---------•--------•--------•----------------------------------------------• | E0 | 224 | ‡ | | 87 | 2021 | DOUBLE DAGGER | | E1 | 225 | · | | B7 | 00B7 | MIDDLE DOT | | E2 | 226 | ‚ | | 82 | 201A | SINGLE LOW-9 QUOTATION MARK | | E3 | 227 | „ | | 84 | 201E | DOUBLE LOW-9 QUOTATION MARK | | E4 | 228 | ‰ | | 89 | 2030 | PER MILLE SIGN | | E5 | 229 |  | | C2 | 00C2 | LATIN CAPITAL LETTER A WITH CIRCUMFLEX | | E6 | 230 | Ê | | CA | 00CA | LATIN CAPITAL LETTER E WITH CIRCUMFLEX | | E7 | 231 | Á | | C1 | 00C1 | LATIN CAPITAL LETTER A WITH ACUTE | | E8 | 232 | Ë | | CB | 00CB | LATIN CAPITAL LETTER E WITH DIAERESIS | | E9 | 233 | È | | C8 | 00C8 | LATIN CAPITAL LETTER E WITH GRAVE | | EA | 234 | Í | | CD | 00CD | LATIN CAPITAL LETTER I WITH ACUTE | | EB | 235 | Î | | CE | 00CE | LATIN CAPITAL LETTER I WITH CIRCUMFLEX | | EC | 236 | Ï | | CF | 00CF | LATIN CAPITAL LETTER I WITH DIAERESIS | | ED | 237 | Ì | | CC | 00CC | LATIN CAPITAL LETTER I WITH GRAVE | | EE | 238 | Ó | | D3 | 00D3 | LATIN CAPITAL LETTER O WITH ACUTE | | EF | 239 | Ô | | D4 | 00D4 | LATIN CAPITAL LETTER O WITH CIRCUMFLEX | •------•-------•-------•---------•--------•--------•----------------------------------------------• | F0 | 240 | | NO | | F8FF | APPLE LOGO | | F1 | 241 | Ò | | D2 | 00D2 | LATIN CAPITAL LETTER O WITH GRAVE | | F2 | 242 | Ú | | DA | 00DA | LATIN CAPITAL LETTER U WITH ACUTE | | F3 | 243 | Û | | DB | 00DB | LATIN CAPITAL LETTER U WITH CIRCUMFLEX | | F4 | 244 | Ù | | D9 | 00D9 | LATIN CAPITAL LETTER U WITH GRAVE | | F5 | 245 | ı | NO | | 0131 | LATIN SMALL LETTER DOTLESS I | | F6 | 246 | ˆ | | 88 | 02C6 | MODIFIER LETTER CIRCUMFLEX ACCENT | | F7 | 247 | ˜ | | 98 | 02DC | SMALL TILDE | | F8 | 248 | ¯ | | AF | 00AF | MACRON | | F9 | 249 | ˘ | NO | | 02D8 | BREVE | | FA | 250 | ˙ | NO | | 02D9 | DOT ABOVE | | FB | 251 | ˚ | NO | | 02DA | RING ABOVE | | FC | 252 | ¸ | | B8 | 00B8 | CEDILLA | | FD | 253 | ˝ | NO | | 02DD | DOUBLE ACUTE ACCENT | | FE | 254 | ˛ | NO | | 02DB | OGONEK | | FF | 255 | ˇ | NO | | 02C7 | CARON | •------•-------•-------•---------•--------•--------•----------------------------------------------•
IMPORTANT : I follow, with an other post, below, because a post can’t store more than 16384 characters !!
guy038
-
So, Valerio,
I slightly improve the above table, by notifying the corresponding Windows-1252 hex code of the character ( For instance, the Mac OS Roman hex value
80
represents theÄ
character, which must be replaced with the hexa code\xC4
), in order to be correctly displayed, in a document, with an ANSI or Windows-1252 encoding. )Note that some characters, displayed in MAC OS Roman encoding, DON’T exist, in Windows-1252 encoding. These are the characters :
[\xAD\xB0\xB2\xB3\xB6\xB7\xB8\xB9\xBA\xBD\xC3\xC5\xC6\xD7\xDA\xDE\xDF\xF0\xF5\xF9\xFA\xFB\xFD\xFE\xFF]
For these characters, in the fourth column, the mention NO has been added and NO corresponding Hex W-1252 value is indicated in the fifth column
I found out an awful, but correct, regex, which converts a MAC OS Roman text in a Windows-1252 text. Basically, this regex find two types of characters :
-
Any character code, of the form
\xnn
, is changed in its corresponding Windows-1252 code, in order to get the same character glyph. For instance, the hexa code(\x80)
( group 1 ), is replaced with\xc4
, thanks to the replacement form(?1\xC4)
-
Any character code, from the list
([\xAD\xB0\xB2\xB3\xB6\xB7\xB8\xB9\xBA\xBD\xC3\xC5\xC6\xD7\xDA\xDE\xDF\xF0\xF5\xF9\xFA\xFB\xFD\xFE\xFF])
( last group 104 ), which DON’T have any corresponding code, in Windows-1252 encoding, are replaced with the usual question mark character?
, of hexa code\x3F
Of course, any character of code <
\x80
, is NOT changed, at all !Note that the (?-i) form, at the beginning of the search regex, forces the regex engine to take case in account ( NON insensitive ), even you didn’t check the match case option
So, Valerio, follow the few steps, below :
-
Open Notepad++
-
Open a new document ( CTRL + N )
-
If necessary, choose the ANSI encoding ( Menu option Encoding - Convert to ANSI )
-
Copy your MAC OS Roman text, in this new document
–> Well, your text should, still, miss some accentuated characters !
-
Move back to the very beginning of the file ( CTRL + Origin )
-
Open the Replace dialog ( CTRL + H )
-
Select the Regular expression search mode
-
In the Find what field, type the regex, below :
(?-i)(\x80)|(\x81)|(\x82)|(\x83)|(\x84)|(\x85)|(\x86)|(\x87)|(\x88)|(\x89)|(\x8A)|(\x8B)|(\x8C)|(\x8D)|(\x8E)|(\x8F)|(\x90)|(\x91)|(\x92)|(\x93)|(\x94)|(\x95)|(\x96)|(\x97)|(\x98)|(\x99)|(\x9A)|(\x9B)|(\x9C)|(\x9D)|(\x9E)|(\x9F)|(\xA0)|(\xA1)|(\xA2)|(\xA3)|(\xA4)|(\xA5)|(\xA6)|(\xA7)|(\xA8)|(\xA9)|(\xAA)|(\xAB)|(\xAC)|(\xAE)|(\xAF)|(\xB1)|(\xB4)|(\xB5)|(\xBB)|(\xBC)|(\xBE)|(\xBF)|(\xC0)|(\xC1)|(\xC2)|(\xC4)|(\xC7)|(\xC8)|(\xC9)|(\xCA)|(\xCB)|(\xCC)|(\xCD)|(\xCE)|(\xCF)|(\xD0)|(\xD1)|(\xD2)|(\xD3)|(\xD4)|(\xD5)|(\xD6)|(\xD8)|(\xD9)|(\xDB)|(\xDC)|(\xDD)|(\xE0)|(\xE1)|(\xE2)|(\xE3)|(\xE4)|(\xE5)|(\xE6)|(\xE7)|(\xE8)|(\xE9)|(\xEA)|(\xEB)|(\xEC)|(\xED)|(\xEE)|(\xEF)|(\xF1)|(\xF2)|(\xF3)|(\xF4)|(\xF6)|(\xF7)|(\xF8)|(\xFC)|([\xAD\xB0\xB2\xB3\xB6\xB7\xB8\xB9\xBA\xBD\xC3\xC5\xC6\xD7\xDA\xDE\xDF\xF0\xF5\xF9\xFA\xFB\xFD\xFE\xFF])
- In the Replace with field, type the regex, below :
(?1\xC4)(?2\xC5)(?3\xC7)(?4\xC9)(?5\xD1)(?6\xD6)(?7\xDC)(?8\xE1)(?9\xE0)(?10\xE2)(?11\xE4)(?12\xE3)(?13\xE5)(?14\xE7)(?15\xE9)(?16\xE8)(?17\xEA)(?18\xEB)(?19\xED)(?20\xEC)(?21\xEE)(?22\xEF)(?23\xF1)(?24\xF3)(?25\xF2)(?26\xF4)(?27\xF6)(?28\xF5)(?29\xFA)(?30\xF9)(?31\xFB)(?32\xFC)(?33\x86)(?34\xB0)(?35\xA2)(?36\xA3)(?37\xA7)(?38\x95)(?39\xB6)(?40\xDF)(?41\xAE)(?42\xA9)(?43\x99)(?44\xB4)(?45\xA8)(?46\xC6)(?47\xD8)(?48\xB1)(?49\xA5)(?50\xB5)(?51\xAA)(?52\xBA)(?53\xE6)(?54\xF8)(?55\xBF)(?56\xA1)(?57\xAC)(?58\x83)(?59\xAB)(?60\xBB)(?61\x85)(?62\xA0)(?63\xC0)(?64\xC3)(?65\xD5)(?66\x8C)(?67\x9C)(?68\x96)(?69\x97)(?70\x93)(?71\x94)(?72\x91)(?73\x92)(?74\xF7)(?75\xFF)(?76\x9F)(?77\x80)(?78\x8B)(?79\x9B)(?80\x87)(?81\xB7)(?82\x82)(?83\x84)(?84\x89)(?85\xC2)(?86\xCA)(?87\xC1)(?88\xCB)(?89\xC8)(?90\xCD)(?91\xCE)(?92\xCF)(?93\xCC)(?94\xD3)(?95\xD4)(?96\xD2)(?97\xDA)(?98\xDB)(?99\xD9)(?{100}\x88)(?{101}\x98)(?{102}\xAF)(?{103}\xB8)(?{104}\x3F)
- Click on the Replace All button
Et voilà ! This time, after that S/R, the text should be correctly displayed :-)) Then :
-
Select the Menu option Encoding - Convert to UTF-8 OR Encoding - Convert to UTF-8 BOM
-
Finally, save this changed file !
Best Regards,
guy038
P.S. :
BTW, Claudia, if you see that post, I saw a Python script, called, mac_roman.py, in the folder …\Plugins\PythonScript\lib\encodings. Unfortunately, I couldn’t make it work. I suppose that it changes a MAC Roman text in a standard UTF-8 text !
-
-
Hello guy038,
this file isn’t supposed to be used directly. It’s part of the codecs module which uses it internally
when you specify its codec. E.g.import codecs with codecs.open(r'd:\macroman.txt', 'r', encoding='macroman') as fin: file_content = fin.read() with codecs.open(r'd:\macroman_utf8.txt', 'w', encoding='utf-8') as fout: fout.write(file_content)
First with block opens a file which is assumed to be macroman encoded, reads it, saves it in variable file_content and
closes file automatically.Next with block writes the file with utf-8 encoding and, again, closes automatically.
Cheers
Claudia -
Thank you very much, guy038.
I’m sure that’d do the trick. I’m just surprised we don’t have an option for Mac Roman right in the menu. Do you think there’s a reason for that? It’s just another encoding method, right? Is it fundamentally different from any of the iso-8859-x? -
Hi, Valerio,
Do you think there’s a reason for that?
May be, it’s just because Notepad++ is rather “Windows oriented” !
Is it fundamentally different from any of the iso-8859-x?
Not at all. It’s just an other encoding, as all the others !
You may, also, ask for adding the MAC Roman encoding, in N++, at the address, below :
https://github.com/notepad-plus-plus/notepad-plus-plus/pulls
However, as you can see, you have to be ( very ) patient :-(( There are plenty of requests, in that place !!
Cheers,
guy038
-
Thanks for sharing such amazing info
مجلة رقيقة -
Thanks for sharing
فوائد الليمون -
At least there is an codepage on windows Code Page 10000 Macintosh Roman:
https://msdn.microsoft.com/en-us/library/cc195076.aspx
so adding support should be not to complicated. -
I’m trying to find a way to convert to macRoman. (needed for embedding subtitles into quicktime videos)
I am trying to use the python scripting option.
using:import codecs
with codecs.open(r’d:\utf8.txt’, ‘r’, encoding=‘utf-8’) as fin:
file_content = fin.read()
with codecs.open(r’d:\macroman.txt’, ‘w’, encoding=‘macroman’) as fout:
fout.write(file_content)I end up with macroman.txt encoded as ANSI, and empty.
Any help here would be much appreciated.
-
what about printing file_content to the python script console using
console.write(file_content)
Cheers
Claudia -
Thanks for sharing such amazing info
sihtk.com
كل ما يهمك سيدتي -
I did find an Encoding option for it, but it’s buried where you wouldn’t expect!
Encoding > Character Sets > Cyrillic > Macintosh
Presto, all the Õ symbols become ’ symbols the way they were typed on the Mac!