How to extract ....
-
Hello, Im new to IT and REGEX, maybe somebody can easily help me. Im looing for the Regex Exprssion to filter out useless lines/words out of a large .txt file with every knd of number, sign, word whatever in it.
The expression filter out the “XXXX”
in the line shown like " WORD:XXX "
or line shows like “OTHERWORD:XXX”thank you really much
-
Hello @martin-huh,
Well, Your goal is not very clear ! you must be as accurate as possible because regular expressions is a school of precision ;-))
-
Do you want to FIND words, immediately followed with a
:XXX
string, with this case ? -
Do you want to MARK words, immediately followed with a
:XXX
string, with this case ? -
Do you want to EXTRACT words, immediately followed with a
:XXX
string, with this case ? -
Do you want to DELETE words, immediately followed with a
:XXX
string, with this case ?
And idem for lines :
-
Do you want to FIND any line containing, at least, one word, immediately followed a
:XXX
string, with this case ? -
Do you want to MARK any line containing, at least, one word, immediately followed a
:XXX
string, with this case ? -
Do you want to EXTRACT any line containing, at least, one word, immediately followed a
:XXX
string, with this case ? -
Do you want to DELETE any line containing, at least, one word, immediately followed a
:XXX
string, with this case ?
On the other hand, you spoke, both, about the
:XXXX
expression and the:XXX
one ? which one is relevant. May be theXXX
part is generic and designs a specific expression !? So, shortly, give us additional information !
Regular expressions can match, practically, any kind of text ! Just tell us what exact text is needed. For instance :
- I want to find any text between columns
30
and40
:
(?-s)^.{29}\K.{11}
- I want to extract any line containing two times the string
abc
(?-s)^.*abc.*abc.*
- I want to mark any multi-lines text, beginning with
<!-- START -->
and ending with<!-- END -->
:
(?s)<!-- START -->.*?<!-- END -->
- I want to delete the three last characters of the sixth field of any line, in a
TSV
file :
^(?:([^\t\r\n])+\t){5}(?1)+\K(?1){3}(?=\t)
and so on…
TIA,
Best regards,
guy038
-
-
Hello guy038,
thanks first foryour help and sorry for not speaking clearly.
Imaginne I have a huge text file and I want to extract just the variables after a specific word.
Example -> my keyword is ECB and I want to extract the word what is next to it (marked in thick black). It would be in this case :
First result line : unveils,
Second rsult leine: Frankfurt
Third result line:lacks
Fourth: boardWhen the ECB unveils the results of its grand strategy review this year, there will be at least one stark contrast with the U.S. Federal Reserve’s own exercise. Inequality in the labor market, a hot-button topic of the 2020s and a core part of the Fed’s conclusions, looks likely to get much more subdued treatment in ECB Frankfurt.
That’s partly because the ECB lacks the Fed’s dual mandate for price stability and full employment. But it’s also because policymakers in Europe don’t have access to data to give them a full picture of inequality in the region, including whether racial and ethnic minorities are benefiting equally from monetary and fiscal stimulus.
Bloomberg’s analysis of speeches by ECB board members shows that mentions of terms related to labor markets have declined, while references to climate change and a digital euro—both issues popular with President Christine Lagarde—have increased.
-
Hi, @martin-huh and All,
Ah… OK ! So, here is the road map :
-
Open your huge file in Notepad++
-
Open the Mark dialog (
Ctrl + M
) -
SEARCH
(?-i)(?<=ECB )\w+
-
Tick the three options
Bookmark line
,Purge for each search
andWrap around
-
Click on the
Mark All
button
=> The appropriate words, which follow the string
ECB
and a space char should be highlighted in red-
Now, click on the
Copy Marked Text
button -
Open a new tab (
Ctrl + N
) -
Paste the clipboard contents (
Ctrl + V
)
Here you are ! You get the list of all these specific words
Now, if you prefer the list of all lines containing, at least, one of these key-words :
-
Right-click on the Bookmark margin and select the
Copy Bookmarked Lines
item ( or use theSearch > Bookmark > Copy Bookmarked Lines
option ) -
Again, open a new tab (
Ctrl + N
) -
Paste the clipboard contents (
Ctrl + V
)
Notes :
-
The in-line modifier
(?-i)
forces the search to be sensitive to case ( non-ignore case ), whatever you’ve ticked, or not, theMatch case
option -
The
\w+
represents the non-null range of regex word characters to search for -
The
(?<=ECB )
is alook-behind
structure, so a condition which must be true before the word to match but which is not part of the match ( Note the space char before the closing parenthesis ) -
So the overall regex can be expressed, in English language, as :
Match any word which is preceded by the string
"ECB "
, with that exact caseBest regards,
guy038
-