Community
    • Login

    Problemewith regex in multi-line search

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 99 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Richard DarwinR
      Richard Darwin
      last edited by

      Fellow Notepad++ Users,

      Could you please help me the the following search-and-replace problem I am having?

      Here is the debug info:

      Notepad++ v8.7.5   (32-bit)
      Build time : Dec 21 2024 - 05:11:15
      Path : I:\Binaries\Notepad++\notepad++.exe
      Command Line : 
      Admin mode : OFF
      Local Conf mode : OFF
      Cloud Config : OFF
      Periodic Backup : OFF
      Placeholders : OFF
      DirectWrite : ON
      Multi-instance Mode : multiInst
      File Status Auto-Detection : cdEnabledNew (for current file/tab only)
      Dark Mode : OFF
      OS Name : Windows 11 Home (64-bit)
      OS Version : 24H2
      OS Build : 26100.3476
      Current ANSI codepage : 1252
      Plugins : 
          mimeTools (3.1)
          NppConverter (4.6)
          NppExport (0.4)
      

      I am trying to massage some data about English syllable structure. I have parsed an IPA data file into phoneme classes: P represents [p]|[t]|[k], B represents the voiced counterparts, X represents [f]|[θ]|[s]|[ʃ], Ɣ the voiced fricatives, and so on. V obviously represents any vowel.

      What I want to do is find adjacent lines that are identical except that the second one ends in an additional [Ɣ][VƔ] or [X], which will be a plural version of the upper word, or a third-person verb: the original text might have been ‘limp /ˈɫɪmp/’, ‘limps /ˈɫɪmps/’, which are massaged into ˈLVNP’, ‘LVNPX’. I want to replace that suffix with ‘=Z’ (a generalization of =Zp ‘plural’ and =Z3 ‘3rd.person.singular’).

      Here is the data I currently have (“before” data):

       
       BRVNBVˈNVPV
       BRVNVˈPVPVV
       --snip--
      ˌƔVXˈPVƔVPX
      ˌƔVƔVˈBVRVPV
      ˌƔVƔVˈPVVXVN
      ˌƔVƔVˈPVVXVNƔ
      ˌƔVƔVˈVNV
      ˌƔVƔVˈVNVV
      ˌƔVƔWVRVˈƔVVXVN
      ˌƔVˈBRVBV
      ˌƔVˈBRVX
      ˌƔVˈBRVXVX
      ˌƔVˈBVPV
      ˌƔVˈBVRV
      ˌƔVˈBVRVV
      ˌƔVˈBVRVVƔ
      ˌƔVˈBVX
      ˌƔVˈBVXVX
      

      Here is how I would like that data to look (“after” data):

       
       BRVNBVˈNVPV
       BRVNVˈPVPVV
       --snip--
      ˌƔVXˈPVƔVPX
      ˌƔVƔVˈBVRVPV
      ˌƔVƔVˈPVVXVN
      ˌƔVƔVˈPVVXVN=Z
      ˌƔVƔVˈVNV
      ˌƔVƔVˈVNVV
      ˌƔVƔWVRVˈƔVVXVN
      ˌƔVˈBRVBV
      ˌƔVˈBRVX
      ˌƔVˈBRVXVX
      ˌƔVˈBVPV
      ˌƔVˈBVRV
      ˌƔVˈBVRVV
      ˌƔVˈBVRVV=Z
      ˌƔVˈBVX
      

      To accomplish this, I have tried using the following Find/Replace expressions and settings

      • Find What = ^(.+)$\n^\1Ɣ$
      • Replace With = \1\n\1=Z
      • Search Mode = REGULAR EXPRESSION
      • Dot Matches Newline = CHECKED or NOT CHECKED

      This regex is supposed to match the entire first line and the ‘\n’ and a second line with the suffix (initially just Ɣ). With ‘Dot matches Newline’ unset, this returns ‘Find: can’t find the text
      “^(.+)$\n^\1Ɣ$” in entire file’; with the setting checked, it searches the entire file and selects the first line (blank) and all but the last three characters of the second line, or it says ‘Invalid Regular Expression’ and notes that ‘complexity exceeds predefined bounds’, depending on where I start the search (the file is 646k, 64k lines).
      Taking out the carets from the regex did not seem to have any effect; taking out the '$'s with Dot unchecked led to ‘can’t find text’ and withDot checked led to ‘invalid regex’.

      This did not produce the output I desired, and I’m not sure why. Could you please help me understand what went wrong and help me find the solution?

      CoisesC 1 Reply Last reply Reply Quote 2
      • CoisesC
        Coises @Richard Darwin
        last edited by Coises

        @Richard-Darwin

        Most likely you are editing a file with Windows line endings, which are \r\n, not just \n.

        I would suggest:

        • Leave . matches newline unchecked — you don’t want that .+ to cross lines

        • Use \R to represent the break between lines — it matches \r, \n or \r\n

        • You don’t need the $ when it’s immediately followed by a line-ending character, nor the circumflex immediately after (though they don’t hurt, either)

        So, try: ^(.+)\R\1Ɣ$ and see if that matches as desired; leave . matches newline unchecked. If you are using Windows line endings, use \1\r\n\1=Z to replace. (On the status bar at the bottom, towards the right side, you’ll see either Windows (CR LF), Unix (LF) or Macintosh (CR), which tells you the current line ending setting for your file.)

        1 Reply Last reply Reply Quote 3
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors