Community
    • Login

    Problems with bidirectional text & find/replace

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 3 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • House EmptierH
      House Emptier
      last edited by

      I can’t tell if this is a bug, of if I’m using the tools incorrectly. I have an xml document with great big blobs of Javascript inside it, and I’ve used a professional translator with a TM tool to get the vast majority of the text in the xml into Pashto. However, the tool didn’t pick up all of the bits in Javascript, so I usually go through and do a little bit of find/change work. Usually, there are three or four terms that I have do change manually, but this time, there are 108. So I attempted to use the multi find-change syntax:

      find: ("EngPhraseOne")|("EngPhraseTwo")|("EngPhraseThree \(hasParenthedicalNote\)")
      change: (1?"PashtoPhraseOne")(2?"PashtoPhraseTwo")(3?"PashtoPhraseThree \(hasParenthedicalNote\)")
      

      So, that works fine in any of the languages that I usually handle, so long as I keep the change field under roughly two thousand characters. But I just can’t get the Pashto to work. I’d be done by now, if I’d just copied and pasted all 108 strings, but I have the Urdu and the Dari and Arabic yet to do.

      Here’s the “before”:

      //+ Condition1: ListField("$Node1","textselected","JJIS number:")
      //+ Condition1: ListField("$Node1","textselected","Choose one")
      //+ Condition1: ListField("$Node1","textselected","Social Security number:")
      //+ Condition1: ListField("$Node1","textselected","State ID (SID):")
      

      Here’s some intended “after”:

      //+ Condition1: ListField("$Node1","textselected","د JJIS شمیره:")
      //+ Condition1: ListField("$Node1","textselected","یو يې غوره کړئ")
      //+ Condition1: ListField("$Node1","textselected","د ټولنیز مصؤنیت لمبر:")
      //+ Condition1: ListField("$Node1","textselected","د ایالت ID (SID):")
      

      Here are my queries:

      find: ("State ID \(SID\):")|("JJIS number:")|("Choose one")|("Social Security number:")|("Case number:")|("Prime ID:")|("Other:")|("Case consultation")|("Other \(please list specific information below\)")|("Court reports")
      
      replace: (?1"د ایالت ID \(SID\):")(?2"د JJIS شمیره:")(?3"یو يې غوره کړئ")(?4"د ټولنیز مصؤنیت لمبر:")(?5"د قضیې شمېره:")(?6"د پرایم ID:")(?7"نور:")(?8"د قضیې مشاوره")(?9"\نور (مهرباني وکړئ مشخص معلومات لاندې لست\ کړئ)")(?10"د محکمې راپورونه")
      

      I suspect that this is a bidi text bug, because every single one of those Pasho phrases looks fine in Notepad when they’re each on a line by themselves, but when I concatenate 'em for the purpose of putting 'em in the “Replace with” field, stuff starts… happening. Text jumps around, stuff that should be rendered right next to an opening parenthesis (like ?7") gets rendered next to the closing parenthesis, and so on. But I don’t know! However, when I run the query, here’s what I get instead of my desired output:

      //+ Condition1: ListField("$Node1","textselected",""د JJIS شمیره"""")
      //+ Condition1: ListField("$Node1","textselected","""یو يې غوره کړئ""""")
      //+ Condition1: ListField("$Node1","textselected","""د ټولنیز مصؤنیت لمبر""")
      //+ Condition1: ListField("$Node1","textselected","د ایالت ID (SID)""""")
      

      So, am I somehow building my query incorrectly? Have I found a bug? That’s a lot of quote marks.

      Um… and if I have found a bug, can anyone out there suggest a text editor with some kind of multi find-change functionality that I can use? Today? I haven’t used sed in, uh, more than fifteen years. And I only have Windows installed here, although I might have to install another OS if that’d be faster than copy/pasting 600-odd phrases. Maybe I can pull this off with Powershell?

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @House Emptier
        last edited by

        @house-emptier ,

        Sorry for the delayed response. Apparently no one reading the Forum last week felt they understood enough about bidrectional text to be able to sufficiently answer you. Unfortuately, I don’t know know enough about it to help you completely, either – and that’s why I didn’t chime in last week…

        Normally, I don’t like letting questions go unanswered for more than a day or two before I stir the pot with a post like this one, but I hadn’t gone back through the recent posts to look for unanswered questions in a few weeks.

        so long as I keep the change field under roughly two thousand characters.

        yes, there is a character limit of about 2000 characters for search/replace expressions.

        find: ("State ID \(SID\):")|("JJIS number:") …
        replace: (?1"د ایالت ID \(SID\):")(?2"د JJIS شمیره:") …

        Based on the fact that when I took even that smaller two-choice alternation and tried it, it replaced the JJIS differently than if I did just "JJIS number:" => "د JJIS شمیره:", I think you are right that there is some bidi oddity occurring.

        If you want to present it as a bug (see the FAQ), do the most limited case possible – show that if you do the two replacements "State ID \(SID\):" => "د ایالت ID \(SID\):" and "JJIS number:" => "د JJIS شمیره:" separately, they work as expected, but that if you try the two-token replacement that I tried, it replaces things differently.

        But while waiting for a resolution (if it ever comes… bidir is hard to get right), I would think that it would be just as fast to do the replacements one-at-a-time rather than trying the big alternation-regex, even if there weren’t a bug. The amount of time it takes to wrap the replacements seems like it would outweigh the extra time it takes to reset the GUI dialog between each individual replacement.

        I vaguely remember that one of the regulars – @Alan-Kilborn probably – had posted a script for the PythonScript plugin that made doing a list of SEARCH => REPLACE pairs easier, especially if they don’t fit within the character limits for the search/replace fields. But I haven’t been able to find it, and might be mis-remembering who it was (or confusing it with some other similar script that I vaguely remember seeing in my half-decade in the forum).

        Given the sense of urgency portrayed in your post last week, I am guessing there’s a good chance you abandoned Notepad++ in the interim… but if you’re still looking for a solution, and you’re willing to try a script in PythonScript, let us know, and if Alan cannot dig one up (or code one up) for you, I would probably find the time in the next few days to see what I could hack together. If you have abandoned Notepad++, I’m sorry if our lack of response led to that.

        PeterJonesP 1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones @PeterJones
          last edited by PeterJones

          @peterjones said in Problems with bidirectional text & find/replace:

          But I haven’t been able to find it,

          I still couldn’t find it, and I wanted to see if it really was as easy to implement as I thought it might be, and whether it would give the correct results (as compared to the bulk regex with alternation, which does not).

          # encoding=utf-8
          """in response to https://community.notepad-plus-plus.org/topic/23007/
          
          This will set up a translation dictionary, where the key is the "from" and the value is the "to".
          The r'' notation will be used so that you can use regex metacharacters in either the key or value
          
          Define your translation regular expressions here:
          """
          translation = {
              r'"JJIS number:"':              r'"د JJIS شمیره:"',
              r'"Choose one"':                r'"یو يې غوره کړئ"',
              r'"Social Security number:"':   r'"د ټولنیز مصؤنیت لمبر:"',
              r'"State ID \(SID\):"':         r'"د ایالت ID \(SID\):"'
          }
          
          from Npp import editor,notepad,console
          
          class TranslationBot(object):
              def go(self):
                  global translation
                  editor.beginUndoAction()
                  for srch, repl in translation.items():
                      editor.rereplace( srch, repl )
                  editor.endUndoAction()
          
          TranslationBot().go()
          

          With the script as shown, I was able to go from

          //+ Condition1: ListField("$Node1","textselected","JJIS number:")
          //+ Condition1: ListField("$Node1","textselected","Choose one")
          //+ Condition1: ListField("$Node1","textselected","Social Security number:")
          //+ Condition1: ListField("$Node1","textselected","State ID (SID):")
          

          to

          //+ Condition1: ListField("$Node1","textselected","د JJIS شمیره:")
          //+ Condition1: ListField("$Node1","textselected","یو يې غوره کړئ")
          //+ Condition1: ListField("$Node1","textselected","د ټولنیز مصؤنیت لمبر:")
          //+ Condition1: ListField("$Node1","textselected","د ایالت ID (SID):")
          

          … which I believe matches your desired results.

          Instructions

          Installation

          1. Install PythonScript plugin
            • Plugins > Plugins Admin
            • checkmark PythonScript
            • click Install
          2. Create a new script
            • Plugins > Python Script > New script
            • Give it the name TranslationBot.py
            • It should be going in your “user script” directory, which is usually %AppData%\Notepad++\Plugins\Config\PythonScript\scripts\ … If it doesn’t default to that location, then change to the correct location
            • Save
          3. Populate the script
            • Copy the text from the script above, verbatim
            • Paste in the TranslationBot.py file
            • Save
          4. If you want to give it a keyboard shortcut:
            • Plugins > Python Script > Configuration…
            • Select User Scripts
            • Select TranslationBot.py
            • Click the left Add to add the script to the Menu items table
            • OK
            • Exit Notepad++ completely and restart the application
            • Plugins > Python Script will now list TranslationBot
            • Settings > Shortcut Mapper
              • select the Plugin commands tab
              • Filter = TranslationBot
              • Click on TranslationBot in the list, Modify, and set the shortcut as desired, OK
              • Click Close
            • Now that shortcut will be assigned to the TranslationBot script

          Usage

          1. Edit the script to customize the translation dictionary
            • Plugins > Python Script > Scripts
              (or Plugins > Python Script if you followed step#3 above)
            • Ctrl+Click on TranslationBot to edit the script
            • Edit the translation = { ... } section to give your mapping of English => New Language
            • Save
          2. Open the file you want to translate
          3. Run the script to perform the translation, using one of the three choices below:
            • Type the shortcut sequence (if you followed installation#3)
            • or Plugins > Python Script > TranslationBot (if you followed installation#3)
            • or Plugins > Python Script > Scripts > TranslationBot (whether or not you followed installation#3)
          House EmptierH 1 Reply Last reply Reply Quote 3
          • PeterJonesP PeterJones referenced this topic on
          • House EmptierH
            House Emptier @PeterJones
            last edited by

            @peterjones I had found a similar script in my searching on that day, but I cannot recall exactly why I couldn’t get it to work. I will install yours and try it out.

            At any rate, I did not actually abandon Notepad++ over this; I just moved the batching process out of npp and into AutoHotkey, where I mostly automated the one-string-at-a-time pasting process into npp. If I did it one term at a time, there weren’t any problems.

            The thing that brought me back to this thread, actually, was a more thorough bug report, with sample files that can be used (at least on my install) to reproduce the issue. It has something to do with the “Replace All” function. I nailed it down with these Lao files I’m currently working on. If I watch every single “Replace” operation, every one works without issue. If I do a “Replace All” with dozens of terms in Spanish or Italian or whatever, it goes off without a hitch. But if I do a Replace All in Lao or Arabic, Very Weird Things happen. Large chunks of non-matching text vanish during a Replace All, in my Lao file right here.

            Okay! I’ve installed the plugin, I’ve installed your script, and massaged my data into your data format (I really don’t get Python, what are all of those "r"s for?) and ran the script. I get this on the console:

            SyntaxError: Non-ASCII character '\xe0' in file C:\Users\Yah
            Shoor\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\TranslationBot.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
            

            So I dutifully followed the link, where I read:
            "In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding “unicode-escape”. This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. "

            I then modified your script to declare UTF-8 in the header:

            # -*- coding: utf-8 -*-
            

            which seems to do the trick.

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @House Emptier
              last edited by PeterJones

              @house-emptier said in Problems with bidirectional text & find/replace:

              what are all of those "r"s for?

              r'...' marks the string as “raw strings” – which means that backslashes and the like will be left as backslashes, rather than as Python escape sequences. (Needed if your translation regex has a \ in it somewhere, so that the regex engine gets the backslashed text, rather than whatever Python might turn it into)

              I then modified your script to declare UTF-8 in the header … which seems to do the trick.

              Weird, PythonScript’s Python 2.7 has always recognized my # encoding=utf-8 variant of the header just fine for me (I am not a Python language expert, and I couldn’t immediately find the reference to link you to, but I know that my form of the encoding header has worked correctly for me – even with the exact script I pasted above). I’m not sure why the error shows up on “line 2” in your error message, because line 2 is part of the docstring – I wonder if you just didn’t see that first line when you scrolled the black box. But I’m glad changing the header worked for you.

              which seems to do the trick.

              … which I hope also means that the script does what you need it to. If so, Great!

              Alan KilbornA 1 Reply Last reply Reply Quote 0
              • Alan KilbornA
                Alan Kilborn @PeterJones
                last edited by

                @peterjones

                According to http://python.org/dev/peps/pep-0263/, your encoding line matches the given regex’s group 1 with utf-8 which certainly seems fine to me.

                @House-Emptier

                I think you are saying that eventually you got the script to work, but that Notepad++'s Replace All still doesn’t work correctly in the circumstances you describe, correct?

                House EmptierH 1 Reply Last reply Reply Quote 1
                • House EmptierH
                  House Emptier @Alan Kilborn
                  last edited by

                  @alan-kilborn That is correct. I can successfully use Notepad++ to do my find/replace task… IFF I watch every term get replaced. If I use Replace All, with a long “Replace with…” string that has complex/SE Asian script in it, then there seems to be some kind of greediness problem that leads to non-matching text being deleted.

                  @peterjones Yes, thanks for posting that Python script, it was very useful, and I was able to use it to skip a whole bunch of manual copy/pasting. Anyhow, w/r/t the Unicode declaration in the header: I see it now, the problem was between my chair & my keyboard. I think I deliberately excluded your discussion that was surrounded by triple quotes, which I’m guessing is the Python code-comment convention? … Yep, it is! So when I copied your code, I purposefully skipped all of your discussion-in-comments, so I missed the header entirely, due to my Python ignorance.

                  1 Reply Last reply Reply Quote 1
                  • PeterJonesP PeterJones referenced this topic on
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors