Community
    • Login

    Convert subrip file

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 757 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Bernard DaninoB
      Bernard Danino
      last edited by

      Hello,
      How can I convert a subrip (text based srt) file, ANSI coded, in french, therefore with accents, to a UTF8 file without accents?

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @Bernard Danino
        last edited by PeterJones

        @Bernard-Danino ,

        The first step is easy: If you have an ANSI-coded file (I am assuming Windows-1252), Notepad++ can open that directly. Then use Encoding > Convert to UTF-8 to convert it to UTF8. Save it (I would recommend under a new name or new directory, so that you don’t lose your original, in case something goes wrong).

        Notepad++'s regular expression syntax knows something called equivalence classes, which matches a letter and all its accented variants. So [[=a=]] would match any of àáâãäå – that is, any accented a. So you could do a series of search/replaces to do each of the equivalence classes one at a time. Unfortunately, equivalence classes are also case-insensitive, so [[=a=]] also matches all the upper case versions ÀÁÂÃÄÅ. So if you did that match, and then replaced with a, it would take all the upper case variants (including a plain A) and make it lower case. My guess is this would be a deal breaker for you. (Clicking “Match Case” will not prevent that, nor will using (?-i) to make the regex case-sensitive.)

        As an alternative, you could just make your own set-based character class, [àáâãäå], checkmark “Match Case”, and replace with a, and that would un-accent all the accented a variants. (If French doesn’t use all those, you could make a shorter list in the set.)

        You could also use alternation and capture groups in your search expression, and conditional substitution in your replacement expression, and build up something that can do the de-accent in one fell swoop. (You could even record it as a macro, so that you can assign a keystroke to your “deaccent macro”)

        For example,

        • FIND = (?-i:([àáâãäå])|([ÀÁÂÃÄÅ])|([èéêë]))
          • I included the (?-i:...) wrapper to make it case sensitive, even if you forget to checkmark Match Case
        • REPLACE = (?{1}a)(?{2}A)(?{3}e)
          • this says if group 1 (the lowercase accented a’s) matched, replace with a, if uppercase A-accents match in group 2, replace with A, etc.
        • SEARCH MODE = Regular Expression
        • REPLACE multiple times, or REPLACE ALL

        Hopefully, you can see how to expand my example to include other accented characters – put each list of accents in a ([...]) separated by |, and add a new (?{###}x) replacement for each.

        All of these assume you are using Search Mode = Regular Expression.

        See also:

        • https://community.notepad-plus-plus.org/topic/18870/search-accent-insensitive
        • https://community.notepad-plus-plus.org/topic/22938/search-for-accented-words
        1 Reply Last reply Reply Quote 3
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors