How to extract ....



  • Hello, Im new to IT and REGEX, maybe somebody can easily help me. Im looing for the Regex Exprssion to filter out useless lines/words out of a large .txt file with every knd of number, sign, word whatever in it.
    The expression filter out the “XXXX”
    in the line shown like " WORD:XXX "
    or line shows like “OTHERWORD:XXX”

    thank you really much



  • Hello @martin-huh,

    Well, Your goal is not very clear ! you must be as accurate as possible because regular expressions is a school of precision ;-))

    • Do you want to FIND words, immediately followed with a :XXX string, with this case ?

    • Do you want to MARK words, immediately followed with a :XXX string, with this case ?

    • Do you want to EXTRACT words, immediately followed with a :XXX string, with this case ?

    • Do you want to DELETE words, immediately followed with a :XXX string, with this case ?

    And idem for lines :

    • Do you want to FIND any line containing, at least, one word, immediately followed a :XXX string, with this case ?

    • Do you want to MARK any line containing, at least, one word, immediately followed a :XXX string, with this case ?

    • Do you want to EXTRACT any line containing, at least, one word, immediately followed a :XXX string, with this case ?

    • Do you want to DELETE any line containing, at least, one word, immediately followed a :XXX string, with this case ?


    On the other hand, you spoke, both, about the :XXXX expression and the :XXX one ? which one is relevant. May be the XXX part is generic and designs a specific expression !? So, shortly, give us additional information !


    Regular expressions can match, practically, any kind of text ! Just tell us what exact text is needed. For instance :

    • I want to find any text between columns 30 and 40 :

    (?-s)^.{29}\K.{11}

    • I want to extract any line containing two times the string abc

    (?-s)^.*abc.*abc.*

    • I want to mark any multi-lines text, beginning with <!-- START --> and ending with <!-- END --> :

    (?s)<!-- START -->.*?<!-- END -->

    • I want to delete the three last characters of the sixth field of any line, in a TSV file :

    ^(?:([^\t\r\n])+\t){5}(?1)+\K(?1){3}(?=\t)

    and so on…

    TIA,

    Best regards,

    guy038



  • Hello guy038,

    thanks first foryour help and sorry for not speaking clearly.

    Imaginne I have a huge text file and I want to extract just the variables after a specific word.

    Example -> my keyword is ECB and I want to extract the word what is next to it (marked in thick black). It would be in this case :
    First result line : unveils,
    Second rsult leine: Frankfurt
    Third result line:lacks
    Fourth: board

    When the ECB unveils the results of its grand strategy review this year, there will be at least one stark contrast with the U.S. Federal Reserve’s own exercise. Inequality in the labor market, a hot-button topic of the 2020s and a core part of the Fed’s conclusions, looks likely to get much more subdued treatment in ECB Frankfurt.

    That’s partly because the ECB lacks the Fed’s dual mandate for price stability and full employment. But it’s also because policymakers in Europe don’t have access to data to give them a full picture of inequality in the region, including whether racial and ethnic minorities are benefiting equally from monetary and fiscal stimulus.

    Bloomberg’s analysis of speeches by ECB board members shows that mentions of terms related to labor markets have declined, while references to climate change and a digital euro—both issues popular with President Christine Lagarde—have increased.



  • Hi, @martin-huh and All,

    Ah… OK ! So, here is the road map :

    • Open your huge file in Notepad++

    • Open the Mark dialog ( Ctrl + M )

    • SEARCH (?-i)(?<=ECB )\w+

    • Tick the three options Bookmark line, Purge for each search and Wrap around

    • Click on the Mark All button

    => The appropriate words, which follow the string ECB and a space char should be highlighted in red

    • Now, click on the Copy Marked Text button

    • Open a new tab ( Ctrl + N )

    • Paste the clipboard contents ( Ctrl + V )

    Here you are ! You get the list of all these specific words


    Now, if you prefer the list of all lines containing, at least, one of these key-words :

    • Right-click on the Bookmark margin and select the Copy Bookmarked Lines item ( or use the Search > Bookmark > Copy Bookmarked Lines option )

    • Again, open a new tab ( Ctrl + N )

    • Paste the clipboard contents ( Ctrl + V )


    Notes :

    • The in-line modifier (?-i) forces the search to be sensitive to case ( non-ignore case ), whatever you’ve ticked, or not, the Match case option

    • The \w+ represents the non-null range of regex word characters to search for

    • The (?<=ECB ) is a look-behind structure, so a condition which must be true before the word to match but which is not part of the match ( Note the space char before the closing parenthesis )

    • So the overall regex can be expressed, in English language, as :

    Match any word which is preceded by the string "ECB ", with that exact case

    Best regards,

    guy038