Convert subrip file

Bernard Danino

Hello,
How can I convert a subrip (text based srt) file, ANSI coded, in french, therefore with accents, to a UTF8 file without accents?

PeterJones

@Bernard-Danino ,

The first step is easy: If you have an ANSI-coded file (I am assuming Windows-1252), Notepad++ can open that directly. Then use Encoding > Convert to UTF-8 to convert it to UTF8. Save it (I would recommend under a new name or new directory, so that you don’t lose your original, in case something goes wrong).

Notepad++'s regular expression syntax knows something called equivalence classes, which matches a letter and all its accented variants. So [[=a=]] would match any of àáâãäå – that is, any accented a. So you could do a series of search/replaces to do each of the equivalence classes one at a time. Unfortunately, equivalence classes are also case-insensitive, so [[=a=]] also matches all the upper case versions ÀÁÂÃÄÅ. So if you did that match, and then replaced with a, it would take all the upper case variants (including a plain A) and make it lower case. My guess is this would be a deal breaker for you. (Clicking “Match Case” will not prevent that, nor will using (?-i) to make the regex case-sensitive.)

As an alternative, you could just make your own set-based character class, [àáâãäå], checkmark “Match Case”, and replace with a, and that would un-accent all the accented a variants. (If French doesn’t use all those, you could make a shorter list in the set.)

You could also use alternation and capture groups in your search expression, and conditional substitution in your replacement expression, and build up something that can do the de-accent in one fell swoop. (You could even record it as a macro, so that you can assign a keystroke to your “deaccent macro”)

For example,

FIND = (?-i:([àáâãäå])|([ÀÁÂÃÄÅ])|([èéêë]))
- I included the (?-i:...) wrapper to make it case sensitive, even if you forget to checkmark Match Case
REPLACE = (?{1}a)(?{2}A)(?{3}e)
- this says if group 1 (the lowercase accented a’s) matched, replace with a, if uppercase A-accents match in group 2, replace with A, etc.
SEARCH MODE = Regular Expression
REPLACE multiple times, or REPLACE ALL

Hopefully, you can see how to expand my example to include other accented characters – put each list of accents in a ([...]) separated by |, and add a new (?{###}x) replacement for each.

All of these assume you are using Search Mode = Regular Expression.