[Notepad++] Remove everything except a specific string using regex
-
Hello everyone,
I need support with regular expressions.I have a very large query from which I need to extract only the names of the tables used in the query. The tables have different names, but fortunately, they all follow the same pattern: database.schema.table_name.
The table_name is always composed of 6 characters in total: 3 letters followed by 3 numbers. So the complete pattern is:
database.schema.[A-Za-z]{3}\d{3}.
I’d like to remove everything that is between one pattern and the next, replacing it with nothing, effectively removing joins and various conditions that are in between.
For example, if I have:
SELECT * FROM database.schema.abc123 INNER JOIN database.schema.def456 on id1 = id2 LEFT JOIN database.schema.ghi789 on id2 = id3 WHERE 1=1
I would like to have as a result:
database.schema.abc123 database.schema.def456 database.schema.ghi789
I made a first attempt using the regular expression:
database.schema.[A-Za-z]{3}\d{3}(.|\s)*?database.schema.[A-Za-z]{3}\d{3}
but it rightly includes in the selection also the string that I need to keep.
How can I set the regular expression to select only what is between two strings (different from each other but with the same pattern) excluding the strings of interest from the selection?
Thanks everyone for the support. 😊
-
@Francesco-Vespari I would do it as two separate search/replace passes.
The first removes all lines except those that start with
FROM
.
Search:(?-i)^((?!FROM ).*\R)+
Replace:(blank or nothing)
With that I’m doing
(?!FROM ).*\R
which removes one line at a time but wrapped that in an extra()+
as Notepad++ tends to take its time doing a delete regardless of how large it is. The()+
deletes multiple consecutive lines that don’t start withFROM
Once you have that then your original regular expression should work well as there is no extraneous noise in the file.
It can be done in one search/replace but it’s much harder to understand and maintain than the two-pass method that first removes the noise.
-
@Francesco-Vespari said in [Notepad++] Remove everything except a specific string using regex:
How can I set the regular expression to select only what is between two strings (different from each other but with the same pattern) excluding the strings of interest from the selection?
I’ll focus on the general idea, giving hints as to how to do it for your specific case. But there are a plethora of ways. For example, if in
A1BA2
you wanted to keep the “good”A1
andA2
and delete just the “bad” sectionB
. There would be two common ways I’d do such a task:- you could capture the A1 and A2 into groups, and use those groups (rather than empty string) in the replacement: FIND =
(A1)B(A2)
, REPLACE =$1$2
- you could use a combination of
\K
control flow escape (which says the stuff to the left of this symbol must match, but I don’t want it replaced by the replacement) and a lookahead assertion: FIND =A1\KB(?=A2)
, REPLACE = empty
(
A1
,B
, andA2
are meant to be placeholders, which would need to be real regexFor your specific case, the
B
would be your(.|\s)*?
and theA1
andA2
would be the surrounding stuff.A few other specific notes on your attempted regex:
database.schema.[A-Za-z]{3}\d{3}(.|\s)*?database.schema.[A-Za-z]{3}\d{3}
- the
.
between the tokens will match anything, not just a literal dot character. sodatabase-schema-
would also match; use\.
to match literal dot. (.|\s)*?
could be written as just.*?
–.
already matches a normal space. If you want that to also match across lines, you can use the. matches newline
checkmarked, or prefix your whole regex with(?s)
- since the
A1
andA2
portions of your regex are the same regex code, you can make use of numbered recursion – so my example regex could be(A1)\KB(?=(?1))
which puts the regexA1
into group#1, then the(?1)
inside the lookahead says “use the same regex as for group#1, but use it here”
- you could capture the A1 and A2 into groups, and use those groups (rather than empty string) in the replacement: FIND =
-
Hello, @francesco-vespari, @mkupper, @peterjones and all,
Here is my alternate solution :
SEARCH
(?-is)^(?:(?!data).)+\R|(?:(?!data).)+|(database\.schema\.[A-Za-z]{3}[0-9]{3})
REPLACE
?1$1
This solution keeps any empty line which already exists in the present text :
So, for example, given this INPUT text :
SELECT * FROM database.schema.ABC123 INNER JOIN database.schema.def456 on id1 = id2 LEFT JOIN database.schema.ghi789 on id2 = id3 WHERE 1=1 SELECT * FROM database.schema.ABC123 INNER JOIN database.schema.def456 on id1 = id2 LEFT JOIN database.schema.ghi789 on id2 = id3 WHERE 1=1 SELECT * FROM database.schema.ABC123 INNER JOIN database.schema.def456 on id1 = id2 LEFT JOIN database.schema.ghi789 on id2 = id3 WHERE 1=1
we would get this OUTPUT one :
database.schema.ABC123 database.schema.def456 database.schema.ghi789 database.schema.ABC123 database.schema.def456 database.schema.ghi789 database.schema.ABC123 database.schema.def456 database.schema.ghi789
Notes :
-
The first part of the search regex, after the modifiers
(?-is)
is^(?:(?!data).)+\R
which matches any complete line, with its line-break, ONLY IF it does NOT contain the stringdata
, with this case, at any position of current line -
The second alternative
(?:(?!data).)+
matches any part of line, which does NOT contain the stringdata
, with this case -
Until now, just note that no group has been defined because of the two non-capturing syntaxes
(?:...
-
Finally, the third alternative is the string that we want to keep, i.e. the string
(database\.schema\.[A-Za-z]{3}[0-9]{3})
, with this exact case, which is stored as group1
-
As we use a conditional replacement
?1$1
, which means : ONLY rewrite the group1
if this group is defined, we are just left with the expected text !
Best Regards,
guy038
-
-
This post is deleted!