Regex - How to keep only some parts of huge text



  • Hello,
    I’m have got a huge amount of text and I’m trying to keep just a few parts of each line.
    I tried to use [^(?=<![A-Za-z0-9 ?,?:?/?<?>?])]
    To only keep all text between <! and > but without succes. Problem is that It is ignoring all letters, etc everywhere.
    Here is the problem. Image I need to keep only the yellow text on all lines.

    Do you guys know how to make that?



  • Hello, @tonda-ptáčník,

    Personally, in your picture I think that you’re speaking of the part of text, in a kind of color orange, <![CDATA[...........]]>, isn’t it ?

    If so :

    • Open the Replace dialog

    • Type in ^.*(<![CDATA[.*]>) in the Find what: zone

    • Type in \1 in the Replace with: zone

    • Tick the Wrap around option

    • Select the Regular expression search mode

    • Click on the Replace All button

    Best Regards,

    guy038



  • @guy038 Thank you very much it helped me a lot.

    But now I have got another problem. Do you know how to replace a huge amount of text which will be different for each line?
    Like here is what I need:
    I need to edit this Image to look like this Image2 (I copy pasted this by hand, problem is that the file has over 6000 lines) Problem is that I have only a text for each line only with the translated sentences. So the file from which I wanna copy look like this: Image3



  • @Tonda-Ptáčník

    Suggestion: Why don’t you post text here instead of images?

    If someone is willing to help, having some text to experiment with is something that might make them dive in and actually help. Nobody is going to retype text into Notepad++ after looking at some images with text.

    The best way to post text here is to indent it with 4 spaces in Notepad++, then copy and paste here. Don’t make the amount of text too large or it won’t let you post it.



  • Hi, @tonda-ptáčník,

    I understood what you want but I need some additional information :

    Are the lines, which are to be processed, only of these two forms, below ?

    <phrase title=......NOT changed.........<![CDATA[......To be TRANSLATED.......]]></phrase>
    

    OR

    <phrase title=........NOT changed......<![CDATA[.......To be TRANSLATED till the END of line ...............
    <br />
    </strong>......To be TRANSLATED.......</strong>]]></phrase>
    

    Remarks :

    • To just insert normal text without any markdown syntax

    Type, for instance :

    ~~~z
    this text will NOT be interpreted
    ~~~

    and it will be displayed as :

    this text will NOT be interpreted
    
    • If you want to insert text with XML highlighting, simply use :

    ~~~xml
    <!–
    Use MenuEntryName and MenuItemName to localize your commands to add.
    The values should be in English but not in translated language.
    (You can set Notepad++ language back to English from Preferences dialog via menu “Settings->Preferences…”)
    –>
    <Item MenuEntryName=“Edit” MenuItemName=“Cut”/>
    <Item MenuEntryName=“Edit” MenuItemName=“Copy”/>
    <Item MenuEntryName=“Edit” MenuItemName=“Paste”/>
    <Item MenuEntryName=“Edit” MenuItemName=“Delete”/>
    <Item MenuEntryName=“Edit” MenuItemName=“Select all”/>
    <Item MenuEntryName=“Edit” MenuItemName=“Begin/End Select”/>
    ~~~

    and you get :

    		<!-- 
    		Use MenuEntryName and MenuItemName to localize your commands to add. 
    		The values should be in English but not in translated language.
    		(You can set Notepad++ language back to English from Preferences dialog via menu "Settings->Preferences...")
    		-->
            <Item MenuEntryName="Edit" MenuItemName="Cut"/>
            <Item MenuEntryName="Edit" MenuItemName="Copy"/>
            <Item MenuEntryName="Edit" MenuItemName="Paste"/>
            <Item MenuEntryName="Edit" MenuItemName="Delete"/>
            <Item MenuEntryName="Edit" MenuItemName="Select all"/>
            <Item MenuEntryName="Edit" MenuItemName="Begin/End Select"/>
    

    Best Regards,

    guy038



  • @guy038 As I looked into the xml, the lines are in more forms.
    Here is the xml which I need to edit https://url.cmgportal.cz/language-Cestina.xml



  • Hi, @tonda-ptáčník, and All,

    I’m terribly sorry, @tonda-ptáčník, for answering very late, but I was quite busy these last days ! In addition to some family events, I had to fix the tower desktop computer of a dance friend of my wife’s ! Operating system couldn’t start anymore.

    Luckily, I could, first, save all the data ( 50 GB about ! ) on her laptop’s husband and, secondly, restore all data on a new computer, via the network’s box !

    However, just notice that, on this new desktop computer, bought in a very well-known French super-market ( HP Pavilion Desktop PC 570-p015nf ) :

    • We, first, had to install Windows 10, by ourselves, after calling an hot-line ! ( F11 key, at start-up )

    • I was very surprised to realize that the majority of the hard disk space ( 1 Tb ) was NOT allocated !!?? ( C: ~ 50GB, D: = Recovery ~ 10Gb an other small partition, and 826 Go not allocated, yet ! ). So, with the disk manager, I created and formated this free space and moved all the user’s folders ( documents, pictures, downloads …) in this new partition

    Really amazing how the final customer is treated when he doesn’t, necessarily, have all the technical skills to solve such things :-(( It’s a shame, really!


    But, let’s get back to our subject. So, I began to study your XML file …

    • I assume that title="......" attribute, right after <phrase must not be translated. Only, the zones <![CDATA[......]]> need to be translated in Czech language. Am I right about it ?

    Except for the line <?xml ....> and two lines language ...... and </language>, your file contains 6,424 lines, divided into :

    • 90 single-line ranges <phrase> ......</phrase> without any text, so <![CDATA[]]> ( Case A )

    • 5,750 single-line ranges <phrase> ......</phrase>, with raw text, only, inside <![CDATA[......]]>, ( as, for instance, <![CDATA[One day ago]]> ), for 5,750 lines ( Case B)

    • 161 single-line ranges <phrase> ......</phrase>, with other attributes, inside <![CDATA[........]]>, ( as, for instance, <![CDATA[<a href="{board_url}">Visit {board_title}</a>]]> ), for 161 lines ( Case C )

    • 68 multi-line ranges <phrase ...</phase> for 423 lines ( Case D )


    I’m afraid that the 161 lines ( Case C ) and the 68 multi-lines ( Case D ) should be translated manually. Luckily, regarding the 5,750 lines, of Case B, an automatic search/replacement could be considered !

    Here is my method for processing this 5750 translations automatically :

    Firstly, we extract all the text which should be translated :

    • Copy your file, in a new tab

    • Open the Mark dialog ( Search > Mark... )

    • Type the search regex (?-s)^\x20\x20<phrase .+?<![CDATA[\K[^]\r\n<[]+(?=]]></phrase>)

    • Tick the Bookmark line and the Purge for each search options

    • Click on the Mark all button

    • Remove All the unmarked lines with the menu command Search > Bookmark > Remove Unmarked Lines

    • Click on the Clear all marks button of the Mark dialog

    • Now, open the Replace dialog

    • Type in the SEARCH regex (?-s)^\x20\x20<phrase .+?<![CDATA[|]]></phrase>

    • Leave the REPLACE zone EMPTY

    • Click on the Replace all button

    => You get your exact 5750 text lines, which can be translated

    Secondly, we change any line with the same line, written two times, separated with the Black Square character ( \x{25A0} = ) which allow you to easily see the part which needs translating !

    • SEARCH (?-s).+

    • REPLACE $0\x{25A0}$0

    • Tick the Wrap around option and the Regular expression search mode

    • Click on the Replace All button

    Thirdly, and it’s the very big task : either, manually or with the help of a translator on Net, build the analog list, changing the 5,750 sentences, located after the sign, in your Czech language !

    Fourthly, we add the original XML contents and change all the zones, needing translation, with its appropriate translation :

    • Move back to the very beginning of these sentences

    • Paste your original XML file contents, before the present first line

    • Right after, add a separator line, with, at least, 3 equal signs, ===

    • Open the Replace dialog

    • SEARCH (?-s)^(\x20\x20<phrase .+?<![CDATA[)([^]\r\n<[]+)(]]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+

    • REPLACE \1\4\3

    • Tick the Wrap around option and the Regular expression search mode

    • Click on the Replace All button

    => After some time…, any single-line <phrase.........</phrase> should contain the translated sentences, inside the <![CDATA[......]]> areas.

    Important : When doing the test with your real data, it took 9 minutes about, on my old XP laptop, to process the 5,750 lines :-((. But I suppose that this S/R can be executed in less than 3 mn, on modern laptops !!


    To give you a general idea, just consider the beginning of your XML file, with 12 single lines <phrase.......</phrase> and 1 multi-lines block <phrase.......</phrase>, followed with the 12 couples of sentences English/Czech, after the separator line of equal signs

    <?xml version="1.0" encoding="utf-8"?>
    <language title="Čeština" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2">
      <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One day ago]]></phrase>
      <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One month ago]]></phrase>
      <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 time]]></phrase>
      <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One week ago]]></phrase>
      <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 year]]></phrase>
      <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[One year ago]]></phrase>
      <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two weeks ago]]></phrase>
      <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Two years ago]]></phrase>
      <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x image replacement URL]]></phrase>
      <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br />
    <br />
    <strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase>
      <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Three months ago]]></phrase>
      <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Six months ago]]></phrase>
      <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Nine months ago]]></phrase>
    =============
    One day ago■Před jedním dnem
    One month ago■Před měcísem
    1 time■1 krát
    One week ago■Před týdnem
    1 year■1 rok
    One year ago■Před rokem
    Two weeks ago■Před dvěma týdny
    Two years ago■Před dvěma lety
    2x image replacement URL■2x adresa URL pro výměnu obrázku
    Three months ago■Před trěmi měcísi
    Six months ago■Před šesti měcísi
    Nine months ago■Před devíti měcísi
    
    • Open the Replace dialog

    • SEARCH (?-s)^(\x20\x20<phrase .+?<![CDATA[)([^]\r\n<[]+)(]]></phrase>)(?=(?s).+^=.*?\R(?-s)\2\x{25A0}(.+))|(?s)^===+.+

    • REPLACE \1\4\3

    • Tick the Wrap around option and the Regular expression search mode

    • Click on the Replace All button

    You should be left with the expected text :

    <?xml version="1.0" encoding="utf-8"?>
    <language title="Čeština" date_format="d.m.Y" time_format="H:i" currency_format="{value} {symbol}" week_start="0" decimal_point="." thousands_separator="," label_separator=":" comma_separator=", " ellipsis="..." parenthesis_open="(" parenthesis_close=")" language_code="cs-CZ" text_direction="LTR" export_version="2">
      <phrase title="1_day_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před jedním dnem]]></phrase>
      <phrase title="1_month_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před měcísem]]></phrase>
      <phrase title="1_time" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 krát]]></phrase>
      <phrase title="1_week_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před týdnem]]></phrase>
      <phrase title="1_year" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[1 rok]]></phrase>
      <phrase title="1_year_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před rokem]]></phrase>
      <phrase title="2_weeks_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před dvěma týdny]]></phrase>
      <phrase title="2_years_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před dvěma lety]]></phrase>
      <phrase title="2x_image_replacement_url" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[2x adresa URL pro výměnu obrázku]]></phrase>
      <phrase title="2x_image_replacement_url_explain" addon_id="XF" version_id="2000010" version_string="2.0.0 Alpha"><![CDATA[If provided, the 2x image will be automatically displayed instead of the image URL above on devices capable of displaying a higher pixel resolution.<br />
    <br />
    <strong>Note: This option has no effect with sprite mode enabled.</strong>]]></phrase>
      <phrase title="3_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před trěmi měcísi]]></phrase>
      <phrase title="6_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před šesti měcísi]]></phrase>
      <phrase title="9_months_ago" addon_id="XF" version_id="1010010" version_string="1.1.0 Alpha"><![CDATA[Před devíti měcísi]]></phrase>
    

    Just remark, in this small example, that the multi-line block <phrase.......</phrase>, containing the text " If provided, the 2x image…sprite mode enabled. ", remains unchanged ( only single-lines are changed ! )

    See you later,

    Best Regards,

    guy038

    P.S. :

    If necessary, you can send me some files by e-mail at tguy.038@gmail.com



  • @guy038

    off topic:
    welcome back. 👍
    almost a week without you is very close to the border of worrying enough to start a search party ;-) 😉

    sincere and best regards
    metachuh


Log in to reply