• Login
Community
  • Login

[Notepad++] Remove everything except a specific string using regex

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
regexregex group replacesql
5 Posts 4 Posters 1.2k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F
    Francesco Vespari
    last edited by Oct 10, 2024, 7:14 PM

    Hello everyone,
    I need support with regular expressions.

    I have a very large query from which I need to extract only the names of the tables used in the query. The tables have different names, but fortunately, they all follow the same pattern: database.schema.table_name.

    The table_name is always composed of 6 characters in total: 3 letters followed by 3 numbers. So the complete pattern is:

    database.schema.[A-Za-z]{3}\d{3}.
    

    I’d like to remove everything that is between one pattern and the next, replacing it with nothing, effectively removing joins and various conditions that are in between.

    For example, if I have:

    SELECT *
    FROM database.schema.abc123
    INNER JOIN database.schema.def456 on id1 = id2
    LEFT JOIN database.schema.ghi789 on id2 = id3
    WHERE 1=1
    

    I would like to have as a result:

    database.schema.abc123
    database.schema.def456
    database.schema.ghi789
    

    I made a first attempt using the regular expression:

    database.schema.[A-Za-z]{3}\d{3}(.|\s)*?database.schema.[A-Za-z]{3}\d{3}
    

    but it rightly includes in the selection also the string that I need to keep.

    How can I set the regular expression to select only what is between two strings (different from each other but with the same pattern) excluding the strings of interest from the selection?

    Thanks everyone for the support. 😊

    M P 2 Replies Last reply Oct 10, 2024, 7:28 PM Reply Quote 1
    • M
      mkupper @Francesco Vespari
      last edited by Oct 10, 2024, 7:28 PM

      @Francesco-Vespari I would do it as two separate search/replace passes.

      The first removes all lines except those that start with FROM.
      Search: (?-i)^((?!FROM ).*\R)+
      Replace: (blank or nothing)

      With that I’m doing (?!FROM ).*\R which removes one line at a time but wrapped that in an extra ()+ as Notepad++ tends to take its time doing a delete regardless of how large it is. The ()+ deletes multiple consecutive lines that don’t start with FROM

      Once you have that then your original regular expression should work well as there is no extraneous noise in the file.

      It can be done in one search/replace but it’s much harder to understand and maintain than the two-pass method that first removes the noise.

      1 Reply Last reply Reply Quote 1
      • P
        PeterJones @Francesco Vespari
        last edited by Oct 10, 2024, 7:43 PM

        @Francesco-Vespari said in [Notepad++] Remove everything except a specific string using regex:

        How can I set the regular expression to select only what is between two strings (different from each other but with the same pattern) excluding the strings of interest from the selection?

        I’ll focus on the general idea, giving hints as to how to do it for your specific case. But there are a plethora of ways. For example, if in A1BA2 you wanted to keep the “good” A1 and A2 and delete just the “bad” section B. There would be two common ways I’d do such a task:

        • you could capture the A1 and A2 into groups, and use those groups (rather than empty string) in the replacement: FIND = (A1)B(A2) , REPLACE = $1$2
        • you could use a combination of \K control flow escape (which says the stuff to the left of this symbol must match, but I don’t want it replaced by the replacement) and a lookahead assertion : FIND = A1\KB(?=A2) , REPLACE = empty

        (A1, B, and A2 are meant to be placeholders, which would need to be real regex

        For your specific case, the B would be your (.|\s)*? and the A1 and A2 would be the surrounding stuff.

        A few other specific notes on your attempted regex:

        database.schema.[A-Za-z]{3}\d{3}(.|\s)*?database.schema.[A-Za-z]{3}\d{3}

        • the . between the tokens will match anything, not just a literal dot character. so database-schema- would also match; use \. to match literal dot.
        • (.|\s)*? could be written as just .*? – . already matches a normal space. If you want that to also match across lines, you can use the . matches newline checkmarked, or prefix your whole regex with (?s)
        • since the A1 and A2 portions of your regex are the same regex code, you can make use of numbered recursion – so my example regex could be (A1)\KB(?=(?1)) which puts the regex A1 into group#1, then the (?1) inside the lookahead says “use the same regex as for group#1, but use it here”
        1 Reply Last reply Reply Quote 1
        • G
          guy038
          last edited by guy038 Oct 11, 2024, 6:03 AM Oct 11, 2024, 5:56 AM

          Hello, @francesco-vespari, @mkupper, @peterjones and all,

          Here is my alternate solution :

          SEARCH (?-is)^(?:(?!data).)+\R|(?:(?!data).)+|(database\.schema\.[A-Za-z]{3}[0-9]{3})

          REPLACE ?1$1

          This solution keeps any empty line which already exists in the present text :

          So, for example, given this INPUT text :

          SELECT *
          FROM database.schema.ABC123
          INNER JOIN database.schema.def456 on id1 = id2
          LEFT JOIN database.schema.ghi789 on id2 = id3
          
          WHERE 1=1
          
          SELECT *
          FROM database.schema.ABC123
          INNER JOIN database.schema.def456 on id1 = id2
          LEFT JOIN database.schema.ghi789 on id2 = id3
          WHERE 1=1
          
          
          
          
          
          SELECT *
          FROM database.schema.ABC123
          INNER JOIN database.schema.def456 on id1 = id2
          LEFT JOIN database.schema.ghi789 on id2 = id3
          WHERE 1=1
          

          we would get this OUTPUT one :

          database.schema.ABC123
          database.schema.def456
          database.schema.ghi789
          
          
          database.schema.ABC123
          database.schema.def456
          database.schema.ghi789
          
          
          
          
          
          database.schema.ABC123
          database.schema.def456
          database.schema.ghi789
          

          Notes :

          • The first part of the search regex, after the modifiers (?-is) is ^(?:(?!data).)+\R which matches any complete line, with its line-break, ONLY IF it does NOT contain the string data, with this case, at any position of current line

          • The second alternative (?:(?!data).)+ matches any part of line, which does NOT contain the string data, with this case

          • Until now, just note that no group has been defined because of the two non-capturing syntaxes (?:...

          • Finally, the third alternative is the string that we want to keep, i.e. the string (database\.schema\.[A-Za-z]{3}[0-9]{3}), with this exact case, which is stored as group 1

          • As we use a conditional replacement ?1$1, which means : ONLY rewrite the group 1 if this group is defined, we are just left with the expected text !

          Best Regards,

          guy038

          F 1 Reply Last reply Oct 15, 2024, 9:00 PM Reply Quote 3
          • F
            Francesco Vespari @guy038
            last edited by Oct 15, 2024, 9:00 PM

            This post is deleted!
            1 Reply Last reply Reply Quote 0
            3 out of 5
            • First post
              3/5
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors