Regex - How to keep only some parts of huge text
-
Hello,
Iâm have got a huge amount of text and Iâm trying to keep just a few parts of each line.
I tried to use [^(?=<![A-Za-z0-9 ?,?:?/?<?>?])]
To only keep all text between <! and > but without succes. Problem is that It is ignoring all letters, etc everywhere.
Here is the problem. Image I need to keep only the yellow text on all lines.Do you guys know how to make that?
-
Hello, @tonda-ptĂĄÄnĂk,
Personally, in your picture I think that youâre speaking of the part of text, in a kind of color orange,
<![CDATA[...........]]>
, isnât it ?If so :
-
Open the Replace dialog
-
Type in
^.*(<!\[CDATA\[.*\]>)
in the Find what: zone -
Type in
\1
in the Replace with: zone -
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
Best Regards,
guy038
-
-
@guy038 Thank you very much it helped me a lot.
But now I have got another problem. Do you know how to replace a huge amount of text which will be different for each line?
Like here is what I need:
I need to edit this Image to look like this Image2 (I copy pasted this by hand, problem is that the file has over 6000 lines) Problem is that I have only a text for each line only with the translated sentences. So the file from which I wanna copy look like this: Image3 -
Suggestion: Why donât you post text here instead of images?
If someone is willing to help, having some text to experiment with is something that might make them dive in and actually help. Nobody is going to retype text into Notepad++ after looking at some images with text.
The best way to post text here is to indent it with 4 spaces in Notepad++, then copy and paste here. Donât make the amount of text too large or it wonât let you post it.
-
Hi, @tonda-ptĂĄÄnĂk,
I understood what you want but I need some additional information :
Are the lines, which are to be processed, only of these two forms, below ?
<phrase title=......NOT changed.........<![CDATA[......To be TRANSLATED.......]]></phrase>
OR
<phrase title=........NOT changed......<![CDATA[.......To be TRANSLATED till the END of line ............... <br /> </strong>......To be TRANSLATED.......</strong>]]></phrase>
Remarks :
- To just insert normal text without any markdown syntax
Type, for instance :
~~~z
this text will NOT be interpreted
~~~and it will be displayed as :
this text will NOT be interpreted
- If you want to insert text with XML highlighting, simply use :
~~~xml
<!â
Use MenuEntryName and MenuItemName to localize your commands to add.
The values should be in English but not in translated language.
(You can set Notepad++ language back to English from Preferences dialog via menu âSettings->PreferencesâŚâ)
â>
<Item MenuEntryName=âEditâ MenuItemName=âCutâ/>
<Item MenuEntryName=âEditâ MenuItemName=âCopyâ/>
<Item MenuEntryName=âEditâ MenuItemName=âPasteâ/>
<Item MenuEntryName=âEditâ MenuItemName=âDeleteâ/>
<Item MenuEntryName=âEditâ MenuItemName=âSelect allâ/>
<Item MenuEntryName=âEditâ MenuItemName=âBegin/End Selectâ/>
~~~and you get :
<!-- Use MenuEntryName and MenuItemName to localize your commands to add. The values should be in English but not in translated language. (You can set Notepad++ language back to English from Preferences dialog via menu "Settings->Preferences...") --> <Item MenuEntryName="Edit" MenuItemName="Cut"/> <Item MenuEntryName="Edit" MenuItemName="Copy"/> <Item MenuEntryName="Edit" MenuItemName="Paste"/> <Item MenuEntryName="Edit" MenuItemName="Delete"/> <Item MenuEntryName="Edit" MenuItemName="Select all"/> <Item MenuEntryName="Edit" MenuItemName="Begin/End Select"/>
Best Regards,
guy038
-
@guy038 As I looked into the xml, the lines are in more forms.
Here is the xml which I need to edit https://url.cmgportal.cz/language-Cestina.xml -
Hi, @tonda-ptĂĄÄnĂk, and All,
Iâm terribly sorry, @tonda-ptĂĄÄnĂk, for answering very late, but I was quite busy these last days ! In addition to some family events, I had to fix the tower desktop computer of a dance friend of my wifeâs ! Operating system couldnât start anymore.
Luckily, I could, first, save all the data ( 50 GB about ! ) on her laptopâs husband and, secondly, restore all data on a new computer, via the networkâs box !
However, just notice that, on this new desktop computer, bought in a very well-known French super-market (
HP Pavilion Desktop PC 570-p015nf
) :-
We, first, had to install Windows 10, by ourselves, after calling an hot-line ! (
F11
key, at start-up ) -
I was very surprised to realize that the majority of the hard disk space (
1 Tb
) was NOT allocated !!?? ( C: ~50GB
, D: = Recovery ~10Gb
an other small partition, and826 Go
not allocated, yet ! ). So, with the disk manager, I created and formated this free space and moved all the userâs folders ( documents, pictures, downloads âŚ) in this new partition
Really amazing how the final customer is treated when he doesnât, necessarily, have all the technical skills to solve such things :-(( Itâs a shame, really!
But, letâs get back to our subject. So, I began to study your
XML
file âŚ- I assume that
title="......"
attribute, right after<phrase
must not be translated. Only, the zones<![CDATA[......]]>
need to be translated in Czech language. Am I right about it ?
Except for the line
<?xml ....>
and two lineslanguage ......
and</language>
, your file contains6,424
lines, divided into :-
90
single-line ranges<phrase> ......</phrase>
without any text, so<![CDATA[]]>
( Case A ) -
5,750
single-line ranges<phrase> ......</phrase>
, with raw text, only, inside<![CDATA[......]]>
, ( as, for instance,<![CDATA[One day ago]]>
), for5,750
lines ( Case B) -
161
single-line ranges<phrase> ......</phrase>
, with other attributes, inside<![CDATA[........]]>
, ( as, for instance,<![CDATA[<a href="{board_url}">Visit {board_title}</a>]]>
), for161
lines ( Case C ) -
68
multi-line ranges<phrase ...</phase>
for423
lines ( Case D )
Iâm afraid that the
161
lines ( Case C ) and the68
multi-lines ( Case D ) should be translated manually. Luckily, regarding the5,750
lines, of Case B, an automatic search/replacement could be considered !Here is my method for processing this
5750
translations automatically :Firstly, we extract all the text which should be translated :
-
Copy your file, in a new tab
-
Open the Mark dialog (
Search > Mark...
) -
Type the search regex
(?-s)^\x20\x20<phrase .+?<!\[CDATA\[\K[^]\r\n<[]+(?=\]\]></phrase>)
-
Tick the
Bookmark line
and thePurge for each search
options -
Click on the
Mark all
button -
Remove All the unmarked lines with the menu command
Search > Bookmark > Remove Unmarked Lines
-
Click on the
Clear all marks
button of the Mark dialog -
Now, open the Replace dialog
-
Type in the SEARCH regex
(?-s)^\x20\x20<phrase .+?<!\[CDATA\[|\]\]></phrase>
-
Leave the REPLACE zone
EMPTY
-
Click on the
Replace all
button
=> You get your exact
5750
text lines, which can be translatedSecondly, we change any line with the same line, written two times, separated with the Black Square character (
\x{25A0}
=â
) which allow you to easily see the part which needs translating !-
SEARCH
(?-s).+
-
REPLACE
$0\x{25A0}$0
-
Tick the
Wrap around
option and theRegular expression
search mode -
Click on the
Replace All
button
Thirdly, and itâs the very big task : either, manually or with the help of a translator on Net, build the analog list, changing the
5,750
sentences, located after theâ
sign, in your Czech language !Fourthly, we add the original
XML
contents and change all the zones, needing translation, with its appropriate translation :-
Move back to the very beginning of these sentences
-
Paste your original
XML
file contents, before the present first line -
Right after, add a separator line, with, at least,
3
equal signs,===
-
Open the Replace dialog
-
SEARCH
(?-s)^(\x20\x20<phrase .+?<!\[CDATA\[)([^]\r\n<[]+)(\]\]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+
-
REPLACE
\1\4\3
-
Tick the
Wrap around
option and theRegular expression
search mode -
Click on the
Replace All
button
=> After some timeâŚ, any single-line
<phrase.........</phrase>
should contain the translated sentences, inside the<![CDATA[......]]>
areas.Important : When doing the test with your real data, it took
9
minutes about, on my old XP laptop, to process the5,750
lines :-((. But I suppose that this S/R can be executed in less than3 mn
, on modern laptops !!
To give you a general idea, just consider the beginning of your
XML
file, with12
single lines<phrase.......</phrase>
and1
multi-lines block<phrase.......</phrase>
, followed with the12
couples of sentences English/Czech, after the separator line of equal signs<?xml version="1.0" encoding="utf-8"?> <language title="ÄeĹĄtina" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2"> <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One day ago]]></phrase> <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One month ago]]></phrase> <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 time]]></phrase> <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One week ago]]></phrase> <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 year]]></phrase> <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One year ago]]></phrase> <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two weeks ago]]></phrase> <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two years ago]]></phrase> <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x image replacement URL]]></phrase> <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br /> <br /> <strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase> <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Three months ago]]></phrase> <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Six months ago]]></phrase> <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Nine months ago]]></phrase> ============= One day agoâ PĹed jednĂm dnem One month agoâ PĹed mÄcĂsem 1 timeâ 1 krĂĄt One week agoâ PĹed tĂ˝dnem 1 yearâ 1 rok One year agoâ PĹed rokem Two weeks agoâ PĹed dvÄma tĂ˝dny Two years agoâ PĹed dvÄma lety 2x image replacement URLâ 2x adresa URL pro vĂ˝mÄnu obrĂĄzku Three months agoâ PĹed trÄmi mÄcĂsi Six months agoâ PĹed ĹĄesti mÄcĂsi Nine months agoâ PĹed devĂti mÄcĂsi
-
Open the Replace dialog
-
SEARCH
(?-s)^(\x20\x20<phrase .+?<!\[CDATA\[)([^]\r\n<[]+)(\]\]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+
-
REPLACE
\1\4\3
-
Tick the
Wrap around
option and theRegular expression
search mode -
Click on the
Replace All
button
You should be left with the expected text :
<?xml version="1.0" encoding="utf-8"?> <language title="ÄeĹĄtina" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2"> <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed jednĂm dnem]]></phrase> <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed mÄcĂsem]]></phrase> <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 krĂĄt]]></phrase> <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed tĂ˝dnem]]></phrase> <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 rok]]></phrase> <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed rokem]]></phrase> <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed dvÄma tĂ˝dny]]></phrase> <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed dvÄma lety]]></phrase> <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x adresa URL pro vĂ˝mÄnu obrĂĄzku]]></phrase> <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br /> <br /> <strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase> <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed trÄmi mÄcĂsi]]></phrase> <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed ĹĄesti mÄcĂsi]]></phrase> <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[PĹed devĂti mÄcĂsi]]></phrase>
Just remark, in this small example, that the multi-line block
<phrase.......</phrase>
, containing the text " If provided, the 2x imageâŚsprite mode enabled. ", remains unchanged ( only single-lines are changed ! )See you later,
Best Regards,
guy038
P.S. :
If necessary, you can send me some files by e-mail at :
-
-
off topic:
welcome back. đ
almost a week without you is very close to the border of worrying enough to start a search party ;-) đsincere and best regards
metachuh