Problems with bidirectional text & find/replace
-
I can’t tell if this is a bug, of if I’m using the tools incorrectly. I have an xml document with great big blobs of Javascript inside it, and I’ve used a professional translator with a TM tool to get the vast majority of the text in the xml into Pashto. However, the tool didn’t pick up all of the bits in Javascript, so I usually go through and do a little bit of find/change work. Usually, there are three or four terms that I have do change manually, but this time, there are 108. So I attempted to use the multi find-change syntax:
find: ("EngPhraseOne")|("EngPhraseTwo")|("EngPhraseThree \(hasParenthedicalNote\)") change: (1?"PashtoPhraseOne")(2?"PashtoPhraseTwo")(3?"PashtoPhraseThree \(hasParenthedicalNote\)")
So, that works fine in any of the languages that I usually handle, so long as I keep the change field under roughly two thousand characters. But I just can’t get the Pashto to work. I’d be done by now, if I’d just copied and pasted all 108 strings, but I have the Urdu and the Dari and Arabic yet to do.
Here’s the “before”:
//+ Condition1: ListField("$Node1","textselected","JJIS number:") //+ Condition1: ListField("$Node1","textselected","Choose one") //+ Condition1: ListField("$Node1","textselected","Social Security number:") //+ Condition1: ListField("$Node1","textselected","State ID (SID):")
Here’s some intended “after”:
//+ Condition1: ListField("$Node1","textselected","د JJIS شمیره:") //+ Condition1: ListField("$Node1","textselected","یو يې غوره کړئ") //+ Condition1: ListField("$Node1","textselected","د ټولنیز مصؤنیت لمبر:") //+ Condition1: ListField("$Node1","textselected","د ایالت ID (SID):")
Here are my queries:
find: ("State ID \(SID\):")|("JJIS number:")|("Choose one")|("Social Security number:")|("Case number:")|("Prime ID:")|("Other:")|("Case consultation")|("Other \(please list specific information below\)")|("Court reports") replace: (?1"د ایالت ID \(SID\):")(?2"د JJIS شمیره:")(?3"یو يې غوره کړئ")(?4"د ټولنیز مصؤنیت لمبر:")(?5"د قضیې شمېره:")(?6"د پرایم ID:")(?7"نور:")(?8"د قضیې مشاوره")(?9"\نور (مهرباني وکړئ مشخص معلومات لاندې لست\ کړئ)")(?10"د محکمې راپورونه")
I suspect that this is a bidi text bug, because every single one of those Pasho phrases looks fine in Notepad when they’re each on a line by themselves, but when I concatenate 'em for the purpose of putting 'em in the “Replace with” field, stuff starts… happening. Text jumps around, stuff that should be rendered right next to an opening parenthesis (like ?7") gets rendered next to the closing parenthesis, and so on. But I don’t know! However, when I run the query, here’s what I get instead of my desired output:
//+ Condition1: ListField("$Node1","textselected",""د JJIS شمیره"""") //+ Condition1: ListField("$Node1","textselected","""یو يې غوره کړئ""""") //+ Condition1: ListField("$Node1","textselected","""د ټولنیز مصؤنیت لمبر""") //+ Condition1: ListField("$Node1","textselected","د ایالت ID (SID)""""")
So, am I somehow building my query incorrectly? Have I found a bug? That’s a lot of quote marks.
Um… and if I have found a bug, can anyone out there suggest a text editor with some kind of multi find-change functionality that I can use? Today? I haven’t used sed in, uh, more than fifteen years. And I only have Windows installed here, although I might have to install another OS if that’d be faster than copy/pasting 600-odd phrases. Maybe I can pull this off with Powershell?
-
Sorry for the delayed response. Apparently no one reading the Forum last week felt they understood enough about bidrectional text to be able to sufficiently answer you. Unfortuately, I don’t know know enough about it to help you completely, either – and that’s why I didn’t chime in last week…
Normally, I don’t like letting questions go unanswered for more than a day or two before I stir the pot with a post like this one, but I hadn’t gone back through the recent posts to look for unanswered questions in a few weeks.
so long as I keep the change field under roughly two thousand characters.
yes, there is a character limit of about 2000 characters for search/replace expressions.
find:
("State ID \(SID\):")|("JJIS number:")
…
replace:(?1"د ایالت ID \(SID\):")(?2"د JJIS شمیره:")
…Based on the fact that when I took even that smaller two-choice alternation and tried it, it replaced the JJIS differently than if I did just
"JJIS number:"
=>"د JJIS شمیره:"
, I think you are right that there is some bidi oddity occurring.If you want to present it as a bug (see the FAQ), do the most limited case possible – show that if you do the two replacements
"State ID \(SID\):"
=>"د ایالت ID \(SID\):"
and"JJIS number:"
=>"د JJIS شمیره:"
separately, they work as expected, but that if you try the two-token replacement that I tried, it replaces things differently.But while waiting for a resolution (if it ever comes… bidir is hard to get right), I would think that it would be just as fast to do the replacements one-at-a-time rather than trying the big alternation-regex, even if there weren’t a bug. The amount of time it takes to wrap the replacements seems like it would outweigh the extra time it takes to reset the GUI dialog between each individual replacement.
I vaguely remember that one of the regulars – @Alan-Kilborn probably – had posted a script for the PythonScript plugin that made doing a list of SEARCH => REPLACE pairs easier, especially if they don’t fit within the character limits for the search/replace fields. But I haven’t been able to find it, and might be mis-remembering who it was (or confusing it with some other similar script that I vaguely remember seeing in my half-decade in the forum).
Given the sense of urgency portrayed in your post last week, I am guessing there’s a good chance you abandoned Notepad++ in the interim… but if you’re still looking for a solution, and you’re willing to try a script in PythonScript, let us know, and if Alan cannot dig one up (or code one up) for you, I would probably find the time in the next few days to see what I could hack together. If you have abandoned Notepad++, I’m sorry if our lack of response led to that.
-
@peterjones said in Problems with bidirectional text & find/replace:
But I haven’t been able to find it,
I still couldn’t find it, and I wanted to see if it really was as easy to implement as I thought it might be, and whether it would give the correct results (as compared to the bulk regex with alternation, which does not).
# encoding=utf-8 """in response to https://community.notepad-plus-plus.org/topic/23007/ This will set up a translation dictionary, where the key is the "from" and the value is the "to". The r'' notation will be used so that you can use regex metacharacters in either the key or value Define your translation regular expressions here: """ translation = { r'"JJIS number:"': r'"د JJIS شمیره:"', r'"Choose one"': r'"یو يې غوره کړئ"', r'"Social Security number:"': r'"د ټولنیز مصؤنیت لمبر:"', r'"State ID \(SID\):"': r'"د ایالت ID \(SID\):"' } from Npp import editor,notepad,console class TranslationBot(object): def go(self): global translation editor.beginUndoAction() for srch, repl in translation.items(): editor.rereplace( srch, repl ) editor.endUndoAction() TranslationBot().go()
With the script as shown, I was able to go from
//+ Condition1: ListField("$Node1","textselected","JJIS number:") //+ Condition1: ListField("$Node1","textselected","Choose one") //+ Condition1: ListField("$Node1","textselected","Social Security number:") //+ Condition1: ListField("$Node1","textselected","State ID (SID):")
to
//+ Condition1: ListField("$Node1","textselected","د JJIS شمیره:") //+ Condition1: ListField("$Node1","textselected","یو يې غوره کړئ") //+ Condition1: ListField("$Node1","textselected","د ټولنیز مصؤنیت لمبر:") //+ Condition1: ListField("$Node1","textselected","د ایالت ID (SID):")
… which I believe matches your desired results.
Instructions
Installation
- Install PythonScript plugin
- Plugins > Plugins Admin
- checkmark
PythonScript
- click Install
- Create a new script
- Plugins > Python Script > New script
- Give it the name
TranslationBot.py
- It should be going in your “user script” directory, which is usually
%AppData%\Notepad++\Plugins\Config\PythonScript\scripts\
… If it doesn’t default to that location, then change to the correct location - Save
- Populate the script
- Copy the text from the script above, verbatim
- Paste in the
TranslationBot.py
file - Save
- If you want to give it a keyboard shortcut:
- Plugins > Python Script > Configuration…
- Select User Scripts
- Select
TranslationBot.py
- Click the left Add to add the script to the Menu items table
- OK
- Exit Notepad++ completely and restart the application
- Plugins > Python Script will now list
TranslationBot
- Settings > Shortcut Mapper
- select the Plugin commands tab
- Filter =
TranslationBot
- Click on TranslationBot in the list, Modify, and set the shortcut as desired, OK
- Click Close
- Now that shortcut will be assigned to the TranslationBot script
Usage
- Edit the script to customize the translation dictionary
- Plugins > Python Script > Scripts
(or Plugins > Python Script if you followed step#3 above) Ctrl+Click
onTranslationBot
to edit the script- Edit the
translation = { ... }
section to give your mapping of English => New Language - Save
- Plugins > Python Script > Scripts
- Open the file you want to translate
- Run the script to perform the translation, using one of the three choices below:
- Type the shortcut sequence (if you followed installation#3)
- or Plugins > Python Script > TranslationBot (if you followed installation#3)
- or Plugins > Python Script > Scripts > TranslationBot (whether or not you followed installation#3)
- Install PythonScript plugin
-
-
@peterjones I had found a similar script in my searching on that day, but I cannot recall exactly why I couldn’t get it to work. I will install yours and try it out.
At any rate, I did not actually abandon Notepad++ over this; I just moved the batching process out of npp and into AutoHotkey, where I mostly automated the one-string-at-a-time pasting process into npp. If I did it one term at a time, there weren’t any problems.
The thing that brought me back to this thread, actually, was a more thorough bug report, with sample files that can be used (at least on my install) to reproduce the issue. It has something to do with the “Replace All” function. I nailed it down with these Lao files I’m currently working on. If I watch every single “Replace” operation, every one works without issue. If I do a “Replace All” with dozens of terms in Spanish or Italian or whatever, it goes off without a hitch. But if I do a Replace All in Lao or Arabic, Very Weird Things happen. Large chunks of non-matching text vanish during a Replace All, in my Lao file right here.
Okay! I’ve installed the plugin, I’ve installed your script, and massaged my data into your data format (I really don’t get Python, what are all of those "r"s for?) and ran the script. I get this on the console:
SyntaxError: Non-ASCII character '\xe0' in file C:\Users\Yah Shoor\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\TranslationBot.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
So I dutifully followed the link, where I read:
"In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding “unicode-escape”. This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. "I then modified your script to declare UTF-8 in the header:
# -*- coding: utf-8 -*-
which seems to do the trick.
-
@house-emptier said in Problems with bidirectional text & find/replace:
what are all of those "r"s for?
r'...'
marks the string as “raw strings” – which means that backslashes and the like will be left as backslashes, rather than as Python escape sequences. (Needed if your translation regex has a\
in it somewhere, so that the regex engine gets the backslashed text, rather than whatever Python might turn it into)I then modified your script to declare UTF-8 in the header … which seems to do the trick.
Weird, PythonScript’s Python 2.7 has always recognized my
# encoding=utf-8
variant of the header just fine for me (I am not a Python language expert, and I couldn’t immediately find the reference to link you to, but I know that my form of the encoding header has worked correctly for me – even with the exact script I pasted above). I’m not sure why the error shows up on “line 2” in your error message, because line 2 is part of the docstring – I wonder if you just didn’t see that first line when you scrolled the black box. But I’m glad changing the header worked for you.which seems to do the trick.
… which I hope also means that the script does what you need it to. If so, Great!
-
According to http://python.org/dev/peps/pep-0263/, your encoding line matches the given regex’s group 1 with
utf-8
which certainly seems fine to me.I think you are saying that eventually you got the script to work, but that Notepad++'s Replace All still doesn’t work correctly in the circumstances you describe, correct?
-
@alan-kilborn That is correct. I can successfully use Notepad++ to do my find/replace task… IFF I watch every term get replaced. If I use Replace All, with a long “Replace with…” string that has complex/SE Asian script in it, then there seems to be some kind of greediness problem that leads to non-matching text being deleted.
@peterjones Yes, thanks for posting that Python script, it was very useful, and I was able to use it to skip a whole bunch of manual copy/pasting. Anyhow, w/r/t the Unicode declaration in the header: I see it now, the problem was between my chair & my keyboard. I think I deliberately excluded your discussion that was surrounded by triple quotes, which I’m guessing is the Python code-comment convention? … Yep, it is! So when I copied your code, I purposefully skipped all of your discussion-in-comments, so I missed the header entirely, due to my Python ignorance.
-