Cyrillic UCase to Lcase in the middle and end of words



  • Hi there,

    I have a text file with a bunch of words in Cyrillic with capital letters in the middle or end of the words like this:

    абигЭль
    зИппо
    клубЫ
    бегОм
    надЕяться…

    I’m trying the following Regex to lowercase ONLY the capitalized characters:

    Find: ( \w+)([\x{0410}-\x{042F}])(\w+)
    Replace with: \1\L\2\E\3

    Apparently, the Find expression works but the same is not true for Replace with.

    Can you help with this one? Thanks!!!



  • Hello, @Mrsimurq,

    There is no solution, indeed :-(( Just because the N++ version of the Boost C++ Regex library still contains some bugs whose that one !

    The case modifiers ( \l, \u, \L and \U ), used in the replacement part, works, only, on characters, with Unicode code-point < \x007F, that is to say, only on the non accentuated set of letters [A-za-z] :-(( Really bad !

    For instance, if you consider the French text, below, all in upper-case, pasted in a new tab :

    C'EST LÀ, PRÈS DE LA FORÊT, DANS UN GÎTE, OÙ RÉGNAIT UN GRAND CAPHARNAÜM, QUE L'AÏEUL ÔTA SA FLÛTE ET SON BÂTON DE SON CANOË
    

    The regex S/R : SEARCH (?s).+ and REPLACE \L$0, would give the text :

    c'est lÀ, prÈs de la forÊt, dans un gÎte, oÙ rÉgnait un grand capharnaÜm, que l'aÏeul Ôta sa flÛte et son bÂton de son canoË
    

    Note that all the accentuated characters are, still, in upper-case !


    Now, assuming the Cyrillic text :

        Upper-Case         Lower-case           Your example
                                             
        АБИГЭЛЬ            абигэль              абигЭль
        ЗИППО              зиппо                зИппо
        КЛУБЫ              клубы                клубЫ
        БЕГОМ              бегом                бегОм
        НАДЕЯТЬСЯ…         надеяться…           надЕяться…
    

    The above S/R would get :

        upper-case         lower-case           your example
                                             
        АБИГЭЛЬ            абигэль              абигЭль
        ЗИППО              зиппо                зИппо
        КЛУБЫ              клубы                клубЫ
        БЕГОМ              бегом                бегОм
        НАДЕЯТЬСЯ…         надеяться…           надЕяться…
    

    Ironically, just the title line is lower-cased ! All the other cyrillic characters, with Unicode value, between \x{0400} and \x{04FF}, are not converted.


    However, if you select, manually, any amount of text, either with a normal or rectangular selection, you may change it :

    • in UPPER-case, with the command menu Edit > Convert Case to > UPPERCASE or Ctrl + Shift + U

    • in lower-case, with the command menu Edit > Convert Case to > lowercase or Ctrl + U

    Best Regards,

    guy038



  • guy038, thanks for your prompt and plain reply! :))

    I just can hope that this issue will be solved asap…

    As a temporary working solution, I just S/R the above capitals, which in fact are vowels only, one by one. Eight in total, not that difficult… So, my Replace with regex for Э -> э looks like: \1э\3

    Thanks!


Log in to reply