Regex - How to keep only some parts of huge text

Tonda Ptáčník

Hello,
I’m have got a huge amount of text and I’m trying to keep just a few parts of each line.
I tried to use [^(?=<![A-Za-z0-9 ?,?:?/?<?>?])]
To only keep all text between <! and > but without succes. Problem is that It is ignoring all letters, etc everywhere.
Here is the problem. Image I need to keep only the yellow text on all lines.

Do you guys know how to make that?

guy038

Hello, @tonda-ptáčník,

Personally, in your picture I think that you’re speaking of the part of text, in a kind of color orange, <![CDATA[...........]]>, isn’t it ?

If so :

Open the Replace dialog
Type in ^.*(<!\[CDATA\[.*\]>) in the Find what: zone
Type in \1 in the Replace with: zone
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button

Best Regards,

guy038

Tonda Ptáčník

@guy038 Thank you very much it helped me a lot.

But now I have got another problem. Do you know how to replace a huge amount of text which will be different for each line?
Like here is what I need:
I need to edit this Image to look like this Image2 (I copy pasted this by hand, problem is that the file has over 6000 lines) Problem is that I have only a text for each line only with the translated sentences. So the file from which I wanna copy look like this: Image3

Scott Sumner

@Tonda-Ptáčník

Suggestion: Why don’t you post text here instead of images?

If someone is willing to help, having some text to experiment with is something that might make them dive in and actually help. Nobody is going to retype text into Notepad++ after looking at some images with text.

The best way to post text here is to indent it with 4 spaces in Notepad++, then copy and paste here. Don’t make the amount of text too large or it won’t let you post it.

guy038

Hi, @tonda-ptáčník,

I understood what you want but I need some additional information :

Are the lines, which are to be processed, only of these two forms, below ?

<phrase title=......NOT changed.........<![CDATA[......To be TRANSLATED.......]]></phrase>

OR

<phrase title=........NOT changed......<![CDATA[.......To be TRANSLATED till the END of line ...............
<br />
</strong>......To be TRANSLATED.......</strong>]]></phrase>

Remarks :

To just insert normal text without any markdown syntax

Type, for instance :

~~~z
this text will NOT be interpreted
~~~

and it will be displayed as :

this text will NOT be interpreted

If you want to insert text with XML highlighting, simply use :

~~~xml
<!–
Use MenuEntryName and MenuItemName to localize your commands to add.
The values should be in English but not in translated language.
(You can set Notepad++ language back to English from Preferences dialog via menu “Settings->Preferences…”)
–>
<Item MenuEntryName=“Edit” MenuItemName=“Cut”/>
<Item MenuEntryName=“Edit” MenuItemName=“Copy”/>
<Item MenuEntryName=“Edit” MenuItemName=“Paste”/>
<Item MenuEntryName=“Edit” MenuItemName=“Delete”/>
<Item MenuEntryName=“Edit” MenuItemName=“Select all”/>
<Item MenuEntryName=“Edit” MenuItemName=“Begin/End Select”/>
~~~

and you get :

		<!-- 
		Use MenuEntryName and MenuItemName to localize your commands to add. 
		The values should be in English but not in translated language.
		(You can set Notepad++ language back to English from Preferences dialog via menu "Settings->Preferences...")
		-->
        <Item MenuEntryName="Edit" MenuItemName="Cut"/>
        <Item MenuEntryName="Edit" MenuItemName="Copy"/>
        <Item MenuEntryName="Edit" MenuItemName="Paste"/>
        <Item MenuEntryName="Edit" MenuItemName="Delete"/>
        <Item MenuEntryName="Edit" MenuItemName="Select all"/>
        <Item MenuEntryName="Edit" MenuItemName="Begin/End Select"/>

Best Regards,

guy038

Tonda Ptáčník

@guy038 As I looked into the xml, the lines are in more forms.
Here is the xml which I need to edit https://url.cmgportal.cz/language-Cestina.xml

guy038

Hi, @tonda-ptáčník, and All,

I’m terribly sorry, @tonda-ptáčník, for answering very late, but I was quite busy these last days ! In addition to some family events, I had to fix the tower desktop computer of a dance friend of my wife’s ! Operating system couldn’t start anymore.

Luckily, I could, first, save all the data ( 50 GB about ! ) on her laptop’s husband and, secondly, restore all data on a new computer, via the network’s box !

However, just notice that, on this new desktop computer, bought in a very well-known French super-market ( HP Pavilion Desktop PC 570-p015nf ) :

We, first, had to install Windows 10, by ourselves, after calling an hot-line ! ( F11 key, at start-up )
I was very surprised to realize that the majority of the hard disk space ( 1 Tb ) was NOT allocated !!?? ( C: ~ 50GB, D: = Recovery ~ 10Gb an other small partition, and 826 Go not allocated, yet ! ). So, with the disk manager, I created and formated this free space and moved all the user’s folders ( documents, pictures, downloads …) in this new partition

Really amazing how the final customer is treated when he doesn’t, necessarily, have all the technical skills to solve such things :-(( It’s a shame, really!

But, let’s get back to our subject. So, I began to study your XML file …

I assume that title="......" attribute, right after <phrase must not be translated. Only, the zones <![CDATA[......]]> need to be translated in Czech language. Am I right about it ?

Except for the line <?xml ....> and two lines language ...... and </language>, your file contains 6,424 lines, divided into :

90 single-line ranges <phrase> ......</phrase> without any text, so <![CDATA[]]> ( Case A )
5,750 single-line ranges <phrase> ......</phrase>, with raw text, only, inside <![CDATA[......]]>, ( as, for instance, <![CDATA[One day ago]]> ), for 5,750 lines ( Case B)
161 single-line ranges <phrase> ......</phrase>, with other attributes, inside <![CDATA[........]]>, ( as, for instance, <![CDATA[<a href="{board_url}">Visit {board_title}</a>]]> ), for 161 lines ( Case C )
68 multi-line ranges <phrase ...</phase> for 423 lines ( Case D )

I’m afraid that the 161 lines ( Case C ) and the 68 multi-lines ( Case D ) should be translated manually. Luckily, regarding the 5,750 lines, of Case B, an automatic search/replacement could be considered !

Here is my method for processing this 5750 translations automatically :

Firstly, we extract all the text which should be translated :

Copy your file, in a new tab
Open the Mark dialog ( Search > Mark... )
Type the search regex (?-s)^\x20\x20<phrase .+?<!\[CDATA\[\K[^]\r\n<[]+(?=\]\]></phrase>)
Tick the Bookmark line and the Purge for each search options
Click on the Mark all button
Remove All the unmarked lines with the menu command Search > Bookmark > Remove Unmarked Lines
Click on the Clear all marks button of the Mark dialog
Now, open the Replace dialog
Type in the SEARCH regex (?-s)^\x20\x20<phrase .+?<!\[CDATA\[|\]\]></phrase>
Leave the REPLACE zone EMPTY
Click on the Replace all button

=> You get your exact 5750 text lines, which can be translated

Secondly, we change any line with the same line, written two times, separated with the Black Square character ( \x{25A0} = ■ ) which allow you to easily see the part which needs translating !

SEARCH (?-s).+
REPLACE $0\x{25A0}$0
Tick the Wrap around option and the Regular expression search mode
Click on the Replace All button

Thirdly, and it’s the very big task : either, manually or with the help of a translator on Net, build the analog list, changing the 5,750 sentences, located after the ■ sign, in your Czech language !

Fourthly, we add the original XML contents and change all the zones, needing translation, with its appropriate translation :

Move back to the very beginning of these sentences
Paste your original XML file contents, before the present first line
Right after, add a separator line, with, at least, 3 equal signs, ===
Open the Replace dialog
SEARCH (?-s)^(\x20\x20<phrase .+?<!\[CDATA\[)([^]\r\n<[]+)(\]\]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+
REPLACE \1\4\3
Tick the Wrap around option and the Regular expression search mode
Click on the Replace All button

=> After some time…, any single-line <phrase.........</phrase> should contain the translated sentences, inside the <![CDATA[......]]> areas.

Important : When doing the test with your real data, it took 9 minutes about, on my old XP laptop, to process the 5,750 lines :-((. But I suppose that this S/R can be executed in less than 3 mn, on modern laptops !!

To give you a general idea, just consider the beginning of your XML file, with 12 single lines <phrase.......</phrase> and 1 multi-lines block <phrase.......</phrase>, followed with the 12 couples of sentences English/Czech, after the separator line of equal signs

<?xml version="1.0" encoding="utf-8"?>
<language title="Čeština" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2">
  <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One day ago]]></phrase>
  <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One month ago]]></phrase>
  <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 time]]></phrase>
  <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One week ago]]></phrase>
  <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 year]]></phrase>
  <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One year ago]]></phrase>
  <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two weeks ago]]></phrase>
  <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two years ago]]></phrase>
  <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x image replacement URL]]></phrase>
  <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br />
<br />
<strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase>
  <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Three months ago]]></phrase>
  <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Six months ago]]></phrase>
  <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Nine months ago]]></phrase>
=============
One day ago■Před jedním dnem
One month ago■Před měcísem
1 time■1 krát
One week ago■Před týdnem
1 year■1 rok
One year ago■Před rokem
Two weeks ago■Před dvěma týdny
Two years ago■Před dvěma lety
2x image replacement URL■2x adresa URL pro výměnu obrázku
Three months ago■Před trěmi měcísi
Six months ago■Před šesti měcísi
Nine months ago■Před devíti měcísi

Open the Replace dialog
SEARCH (?-s)^(\x20\x20<phrase .+?<!\[CDATA\[)([^]\r\n<[]+)(\]\]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+
REPLACE \1\4\3
Tick the Wrap around option and the Regular expression search mode
Click on the Replace All button

You should be left with the expected text :

<?xml version="1.0" encoding="utf-8"?>
<language title="Čeština" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2">
  <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před jedním dnem]]></phrase>
  <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před měcísem]]></phrase>
  <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 krát]]></phrase>
  <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před týdnem]]></phrase>
  <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 rok]]></phrase>
  <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před rokem]]></phrase>
  <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před dvěma týdny]]></phrase>
  <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před dvěma lety]]></phrase>
  <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x adresa URL pro výměnu obrázku]]></phrase>
  <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br />
<br />
<strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase>
  <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před trěmi měcísi]]></phrase>
  <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před šesti měcísi]]></phrase>
  <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před devíti měcísi]]></phrase>

Just remark, in this small example, that the multi-line block <phrase.......</phrase>, containing the text " If provided, the 2x image…sprite mode enabled. ", remains unchanged ( only single-lines are changed ! )

See you later,

Best Regards,

guy038

P.S. :

If necessary, you can send me some files by e-mail at :

Meta Chuh

@guy038

off topic:
welcome back. 👍
almost a week without you is very close to the border of worrying enough to start a search party ;-) 😉

sincere and best regards
metachuh