[Notepad++] Remove everything except a specific string using regex

Francesco Vespari

Hello everyone,
I need support with regular expressions.

I have a very large query from which I need to extract only the names of the tables used in the query. The tables have different names, but fortunately, they all follow the same pattern: database.schema.table_name.

The table_name is always composed of 6 characters in total: 3 letters followed by 3 numbers. So the complete pattern is:

database.schema.[A-Za-z]{3}\d{3}.

I’d like to remove everything that is between one pattern and the next, replacing it with nothing, effectively removing joins and various conditions that are in between.

For example, if I have:

SELECT *
FROM database.schema.abc123
INNER JOIN database.schema.def456 on id1 = id2
LEFT JOIN database.schema.ghi789 on id2 = id3
WHERE 1=1

I would like to have as a result:

database.schema.abc123
database.schema.def456
database.schema.ghi789

I made a first attempt using the regular expression:

database.schema.[A-Za-z]{3}\d{3}(.|\s)*?database.schema.[A-Za-z]{3}\d{3}

but it rightly includes in the selection also the string that I need to keep.

How can I set the regular expression to select only what is between two strings (different from each other but with the same pattern) excluding the strings of interest from the selection?

Thanks everyone for the support. 😊

mkupper

@Francesco-Vespari I would do it as two separate search/replace passes.

The first removes all lines except those that start with FROM.
Search: (?-i)^((?!FROM ).*\R)+
Replace: (blank or nothing)

With that I’m doing (?!FROM ).*\R which removes one line at a time but wrapped that in an extra ()+ as Notepad++ tends to take its time doing a delete regardless of how large it is. The ()+ deletes multiple consecutive lines that don’t start with FROM

Once you have that then your original regular expression should work well as there is no extraneous noise in the file.

It can be done in one search/replace but it’s much harder to understand and maintain than the two-pass method that first removes the noise.

PeterJones

@Francesco-Vespari said in [Notepad++] Remove everything except a specific string using regex:

How can I set the regular expression to select only what is between two strings (different from each other but with the same pattern) excluding the strings of interest from the selection?

I’ll focus on the general idea, giving hints as to how to do it for your specific case. But there are a plethora of ways. For example, if in A1BA2 you wanted to keep the “good” A1 and A2 and delete just the “bad” section B. There would be two common ways I’d do such a task:

you could capture the A1 and A2 into groups, and use those groups (rather than empty string) in the replacement: FIND = (A1)B(A2) , REPLACE = $1$2
you could use a combination of \K control flow escape (which says the stuff to the left of this symbol must match, but I don’t want it replaced by the replacement) and a lookahead assertion: FIND = A1\KB(?=A2) , REPLACE = empty

(A1, B, and A2 are meant to be placeholders, which would need to be real regex

For your specific case, the B would be your (.|\s)*? and the A1 and A2 would be the surrounding stuff.

A few other specific notes on your attempted regex:

database.schema.[A-Za-z]{3}\d{3}(.|\s)*?database.schema.[A-Za-z]{3}\d{3}

the . between the tokens will match anything, not just a literal dot character. so database-schema- would also match; use \. to match literal dot.
(.|\s)*? could be written as just .*? – . already matches a normal space. If you want that to also match across lines, you can use the . matches newline checkmarked, or prefix your whole regex with (?s)
since the A1 and A2 portions of your regex are the same regex code, you can make use of numbered recursion – so my example regex could be (A1)\KB(?=(?1)) which puts the regex A1 into group#1, then the (?1) inside the lookahead says “use the same regex as for group#1, but use it here”

guy038

Hello, @francesco-vespari, @mkupper, @peterjones and all,

Here is my alternate solution :

SEARCH (?-is)^(?:(?!data).)+\R|(?:(?!data).)+|(database\.schema\.[A-Za-z]{3}[0-9]{3})

REPLACE ?1$1

This solution keeps any empty line which already exists in the present text :

So, for example, given this INPUT text :

SELECT *
FROM database.schema.ABC123
INNER JOIN database.schema.def456 on id1 = id2
LEFT JOIN database.schema.ghi789 on id2 = id3

WHERE 1=1

SELECT *
FROM database.schema.ABC123
INNER JOIN database.schema.def456 on id1 = id2
LEFT JOIN database.schema.ghi789 on id2 = id3
WHERE 1=1





SELECT *
FROM database.schema.ABC123
INNER JOIN database.schema.def456 on id1 = id2
LEFT JOIN database.schema.ghi789 on id2 = id3
WHERE 1=1

we would get this OUTPUT one :

database.schema.ABC123
database.schema.def456
database.schema.ghi789


database.schema.ABC123
database.schema.def456
database.schema.ghi789





database.schema.ABC123
database.schema.def456
database.schema.ghi789

Notes :

The first part of the search regex, after the modifiers (?-is) is ^(?:(?!data).)+\R which matches any complete line, with its line-break, ONLY IF it does NOT contain the string data, with this case, at any position of current line
The second alternative (?:(?!data).)+ matches any part of line, which does NOT contain the string data, with this case
Until now, just note that no group has been defined because of the two non-capturing syntaxes (?:...
Finally, the third alternative is the string that we want to keep, i.e. the string (database\.schema\.[A-Za-z]{3}[0-9]{3}), with this exact case, which is stored as group 1
As we use a conditional replacement ?1$1, which means : ONLY rewrite the group 1 if this group is defined, we are just left with the expected text !

Best Regards,

guy038

Francesco Vespari

This post is deleted!