Regex - How to keep only some parts of huge text
-
Hello,
I’m have got a huge amount of text and I’m trying to keep just a few parts of each line.
I tried to use [^(?=<![A-Za-z0-9 ?,?:?/?<?>?])]
To only keep all text between <! and > but without succes. Problem is that It is ignoring all letters, etc everywhere.
Here is the problem. Image I need to keep only the yellow text on all lines.Do you guys know how to make that?
-
Hello, @tonda-ptáčník,
Personally, in your picture I think that you’re speaking of the part of text, in a kind of color orange,
<![CDATA[...........]]>
, isn’t it ?If so :
-
Open the Replace dialog
-
Type in
^.*(<!\[CDATA\[.*\]>)
in the Find what: zone -
Type in
\1
in the Replace with: zone -
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
Best Regards,
guy038
-
-
@guy038 Thank you very much it helped me a lot.
But now I have got another problem. Do you know how to replace a huge amount of text which will be different for each line?
Like here is what I need:
I need to edit this Image to look like this Image2 (I copy pasted this by hand, problem is that the file has over 6000 lines) Problem is that I have only a text for each line only with the translated sentences. So the file from which I wanna copy look like this: Image3 -
Suggestion: Why don’t you post text here instead of images?
If someone is willing to help, having some text to experiment with is something that might make them dive in and actually help. Nobody is going to retype text into Notepad++ after looking at some images with text.
The best way to post text here is to indent it with 4 spaces in Notepad++, then copy and paste here. Don’t make the amount of text too large or it won’t let you post it.
-
Hi, @tonda-ptáčník,
I understood what you want but I need some additional information :
Are the lines, which are to be processed, only of these two forms, below ?
<phrase title=......NOT changed.........<![CDATA[......To be TRANSLATED.......]]></phrase>
OR
<phrase title=........NOT changed......<![CDATA[.......To be TRANSLATED till the END of line ............... <br /> </strong>......To be TRANSLATED.......</strong>]]></phrase>
Remarks :
- To just insert normal text without any markdown syntax
Type, for instance :
~~~z
this text will NOT be interpreted
~~~and it will be displayed as :
this text will NOT be interpreted
- If you want to insert text with XML highlighting, simply use :
~~~xml
<!–
Use MenuEntryName and MenuItemName to localize your commands to add.
The values should be in English but not in translated language.
(You can set Notepad++ language back to English from Preferences dialog via menu “Settings->Preferences…”)
–>
<Item MenuEntryName=“Edit” MenuItemName=“Cut”/>
<Item MenuEntryName=“Edit” MenuItemName=“Copy”/>
<Item MenuEntryName=“Edit” MenuItemName=“Paste”/>
<Item MenuEntryName=“Edit” MenuItemName=“Delete”/>
<Item MenuEntryName=“Edit” MenuItemName=“Select all”/>
<Item MenuEntryName=“Edit” MenuItemName=“Begin/End Select”/>
~~~and you get :
<!-- Use MenuEntryName and MenuItemName to localize your commands to add. The values should be in English but not in translated language. (You can set Notepad++ language back to English from Preferences dialog via menu "Settings->Preferences...") --> <Item MenuEntryName="Edit" MenuItemName="Cut"/> <Item MenuEntryName="Edit" MenuItemName="Copy"/> <Item MenuEntryName="Edit" MenuItemName="Paste"/> <Item MenuEntryName="Edit" MenuItemName="Delete"/> <Item MenuEntryName="Edit" MenuItemName="Select all"/> <Item MenuEntryName="Edit" MenuItemName="Begin/End Select"/>
Best Regards,
guy038
-
@guy038 As I looked into the xml, the lines are in more forms.
Here is the xml which I need to edit https://url.cmgportal.cz/language-Cestina.xml -
Hi, @tonda-ptáčník, and All,
I’m terribly sorry, @tonda-ptáčník, for answering very late, but I was quite busy these last days ! In addition to some family events, I had to fix the tower desktop computer of a dance friend of my wife’s ! Operating system couldn’t start anymore.
Luckily, I could, first, save all the data ( 50 GB about ! ) on her laptop’s husband and, secondly, restore all data on a new computer, via the network’s box !
However, just notice that, on this new desktop computer, bought in a very well-known French super-market (
HP Pavilion Desktop PC 570-p015nf
) :-
We, first, had to install Windows 10, by ourselves, after calling an hot-line ! (
F11
key, at start-up ) -
I was very surprised to realize that the majority of the hard disk space (
1 Tb
) was NOT allocated !!?? ( C: ~50GB
, D: = Recovery ~10Gb
an other small partition, and826 Go
not allocated, yet ! ). So, with the disk manager, I created and formated this free space and moved all the user’s folders ( documents, pictures, downloads …) in this new partition
Really amazing how the final customer is treated when he doesn’t, necessarily, have all the technical skills to solve such things :-(( It’s a shame, really!
But, let’s get back to our subject. So, I began to study your
XML
file …- I assume that
title="......"
attribute, right after<phrase
must not be translated. Only, the zones<![CDATA[......]]>
need to be translated in Czech language. Am I right about it ?
Except for the line
<?xml ....>
and two lineslanguage ......
and</language>
, your file contains6,424
lines, divided into :-
90
single-line ranges<phrase> ......</phrase>
without any text, so<![CDATA[]]>
( Case A ) -
5,750
single-line ranges<phrase> ......</phrase>
, with raw text, only, inside<![CDATA[......]]>
, ( as, for instance,<![CDATA[One day ago]]>
), for5,750
lines ( Case B) -
161
single-line ranges<phrase> ......</phrase>
, with other attributes, inside<![CDATA[........]]>
, ( as, for instance,<![CDATA[<a href="{board_url}">Visit {board_title}</a>]]>
), for161
lines ( Case C ) -
68
multi-line ranges<phrase ...</phase>
for423
lines ( Case D )
I’m afraid that the
161
lines ( Case C ) and the68
multi-lines ( Case D ) should be translated manually. Luckily, regarding the5,750
lines, of Case B, an automatic search/replacement could be considered !Here is my method for processing this
5750
translations automatically :Firstly, we extract all the text which should be translated :
-
Copy your file, in a new tab
-
Open the Mark dialog (
Search > Mark...
) -
Type the search regex
(?-s)^\x20\x20<phrase .+?<!\[CDATA\[\K[^]\r\n<[]+(?=\]\]></phrase>)
-
Tick the
Bookmark line
and thePurge for each search
options -
Click on the
Mark all
button -
Remove All the unmarked lines with the menu command
Search > Bookmark > Remove Unmarked Lines
-
Click on the
Clear all marks
button of the Mark dialog -
Now, open the Replace dialog
-
Type in the SEARCH regex
(?-s)^\x20\x20<phrase .+?<!\[CDATA\[|\]\]></phrase>
-
Leave the REPLACE zone
EMPTY
-
Click on the
Replace all
button
=> You get your exact
5750
text lines, which can be translatedSecondly, we change any line with the same line, written two times, separated with the Black Square character (
\x{25A0}
=■
) which allow you to easily see the part which needs translating !-
SEARCH
(?-s).+
-
REPLACE
$0\x{25A0}$0
-
Tick the
Wrap around
option and theRegular expression
search mode -
Click on the
Replace All
button
Thirdly, and it’s the very big task : either, manually or with the help of a translator on Net, build the analog list, changing the
5,750
sentences, located after the■
sign, in your Czech language !Fourthly, we add the original
XML
contents and change all the zones, needing translation, with its appropriate translation :-
Move back to the very beginning of these sentences
-
Paste your original
XML
file contents, before the present first line -
Right after, add a separator line, with, at least,
3
equal signs,===
-
Open the Replace dialog
-
SEARCH
(?-s)^(\x20\x20<phrase .+?<!\[CDATA\[)([^]\r\n<[]+)(\]\]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+
-
REPLACE
\1\4\3
-
Tick the
Wrap around
option and theRegular expression
search mode -
Click on the
Replace All
button
=> After some time…, any single-line
<phrase.........</phrase>
should contain the translated sentences, inside the<![CDATA[......]]>
areas.Important : When doing the test with your real data, it took
9
minutes about, on my old XP laptop, to process the5,750
lines :-((. But I suppose that this S/R can be executed in less than3 mn
, on modern laptops !!
To give you a general idea, just consider the beginning of your
XML
file, with12
single lines<phrase.......</phrase>
and1
multi-lines block<phrase.......</phrase>
, followed with the12
couples of sentences English/Czech, after the separator line of equal signs<?xml version="1.0" encoding="utf-8"?> <language title="Čeština" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2"> <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One day ago]]></phrase> <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One month ago]]></phrase> <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 time]]></phrase> <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One week ago]]></phrase> <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 year]]></phrase> <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One year ago]]></phrase> <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two weeks ago]]></phrase> <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two years ago]]></phrase> <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x image replacement URL]]></phrase> <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br /> <br /> <strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase> <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Three months ago]]></phrase> <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Six months ago]]></phrase> <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Nine months ago]]></phrase> ============= One day ago■Před jedním dnem One month ago■Před měcísem 1 time■1 krát One week ago■Před týdnem 1 year■1 rok One year ago■Před rokem Two weeks ago■Před dvěma týdny Two years ago■Před dvěma lety 2x image replacement URL■2x adresa URL pro výměnu obrázku Three months ago■Před trěmi měcísi Six months ago■Před šesti měcísi Nine months ago■Před devíti měcísi
-
Open the Replace dialog
-
SEARCH
(?-s)^(\x20\x20<phrase .+?<!\[CDATA\[)([^]\r\n<[]+)(\]\]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+
-
REPLACE
\1\4\3
-
Tick the
Wrap around
option and theRegular expression
search mode -
Click on the
Replace All
button
You should be left with the expected text :
<?xml version="1.0" encoding="utf-8"?> <language title="Čeština" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2"> <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před jedním dnem]]></phrase> <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před měcísem]]></phrase> <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 krát]]></phrase> <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před týdnem]]></phrase> <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 rok]]></phrase> <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před rokem]]></phrase> <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před dvěma týdny]]></phrase> <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před dvěma lety]]></phrase> <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x adresa URL pro výměnu obrázku]]></phrase> <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br /> <br /> <strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase> <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před trěmi měcísi]]></phrase> <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před šesti měcísi]]></phrase> <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před devíti měcísi]]></phrase>
Just remark, in this small example, that the multi-line block
<phrase.......</phrase>
, containing the text " If provided, the 2x image…sprite mode enabled. ", remains unchanged ( only single-lines are changed ! )See you later,
Best Regards,
guy038
P.S. :
If necessary, you can send me some files by e-mail at :
-
-
off topic:
welcome back. 👍
almost a week without you is very close to the border of worrying enough to start a search party ;-) 😉sincere and best regards
metachuh