Community
    • Login

    How to Print Pretty with missing close tags.

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    5 Posts 2 Posters 136 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Doctor RashirD
      Doctor Rashir
      last edited by

      I am looking at a Quicken QFX log file that is in a sort of XML type format. The format has many missing End tags so this causes the XML Tools - Pretty Print to indent nearly forever.

      Is there a way to align the Start and End tags that are present?

      For example in the following code how do I align the bolded lines:

      <OFX>
      	<SIGNONMSGSRQV1>
      		**<SONRQ>**
      			<DTCLIENT>20250520104016.123[-7:MST]
      				<USERID>anonymous00000000000000000000000
      					<USERPASS>X
      						<GENUSERKEY>N
      							<LANGUAGE>ENG
      								<APPID>QWIN
      									<APPVER>2700
      									**</SONRQ>**
      						</SIGNONMSGSRQV1>
      						<INTU.BRANDMSGSRQV1>
      							<INTU.BRANDTRNRQ>
      								<TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
      									<INTU.BRANDRQ>
      
      

      I am running on Win 11, latest update and Np++ v8.8.3

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @Doctor Rashir
        last edited by

        @Doctor-Rashir said in How to Print Pretty with missing close tags.:

        I am looking at a Quicken QFX log file that is in a sort of XML type format. The format has many missing End tags so this causes the XML Tools - Pretty Print to indent nearly forever.

        Is there a way to align the Start and End tags that are present?

        XML Tools is designed to work with well-formed XML. If it’s not well-formed (ie, unclosed tags), it’s just too much of an edge case. It’s doubtful there’s any toolmaker out there who could figure out a way to “pretty print” a seemingly-random mixture of closed and unclosed tags in any meaningful way.

        If you were to unindent everything (Ctrl+A, then Shift+TAB until it’s gone, or search for ^\h+ and replace with nothing), then if you knew in advance which tags (like <SONRQ>) had closing pairs, you could use the zone-of-text regex forumula from our FAQ, as:

        • FIND = (?-si:<SONRQ\b|(?!\A)\G)(?s-i:(?!</SONRQ\b).)*?\K(?-si:^(?!\h*</SONRQ))
          REPLACE = \t
          REPLACE ALL

        If I do three steps: unindent, formula(SONRQ) and formula(SIGNONMSGSRQV1), then with your example data, I get

        <OFX>
        <SIGNONMSGSRQV1>
        	<SONRQ>
        		<DTCLIENT>20250520104016.123[-7:MST]
        		<USERID>anonymous00000000000000000000000
        		<USERPASS>X
        		<GENUSERKEY>N
        		<LANGUAGE>ENG
        		<APPID>QWIN
        		<APPVER>2700
        	</SONRQ>
        </SIGNONMSGSRQV1>
        <INTU.BRANDMSGSRQV1>
        <INTU.BRANDTRNRQ>
        <TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
        <INTU.BRANDRQ>
        

        I don’t know how many other closed tags there are in your file, so I don’t know whether that’s practical for you or not. But it’s the best I can come up with for now, without invoking a full-on programming language (at which point, it could be done in the contents of the Notepad++ window using a plugin like PythonScript, or it could just be done at the command-line with whatever programming language you wanted to use, without needing the file to be open in Notepad++, and thus make it off-topic here)

        I did try to make use of a numbered or named capture group in the BSR section and use a backreference to make the BSR and FR invoke those (see the FAQ for the meaning of BSR / ESR / FR), rather than having to know in advance the names of all the tags… but I couldn’t get those backreference versions to work.

        Doctor RashirD 1 Reply Last reply Reply Quote 1
        • Doctor RashirD
          Doctor Rashir @PeterJones
          last edited by

          @PeterJones
          I really appreciate what you’ve posted. There are many closed tags. And many open tags.
          I’m just trying to analyze the error I’m encountering with Quicken. I’ll look at what you propose but I have to determine how much work it is to fix or just the ones important to my analysis of the log.

          Thanks again.

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones @Doctor Rashir
            last edited by PeterJones

            @Doctor-Rashir ,

            If you are willing to use the PythonScript plugin (instructions found in our FAQ, here; I only tested with PythonScript 3, but I tried to write it so I think it’s compatible with the PythonScript 2 in the Plugins Admin; I recommend PythonScript 3)

            Script: PrettyPrintBadXML.py

            # encoding=utf-8
            """in response to https://community.notepad-plus-plus.org/topic/27254/
            
            This will take malformed XML (many/most tags with no closing tag) and
            pretty-print it so that each layer of closed tags indents its contents
            """
            from Npp import editor
            import re
            
            editor.beginUndoAction()
            
            sEOL = ('\r\n', '\r', '\n')[editor.getEOLMode()]
            
            # First, one tag per line, no indentation
            editor.rereplace(r'\s*<', sEOL + r'<', re.MULTILINE)
            
            # get rid of extra newlines at beginning and end (but final line will end with EOL, so N++ shows empty last line)
            editor.rereplace(r'\A\s+', '', re.MULTILINE)
            editor.rereplace(r'\v+\z', sEOL, re.MULTILINE)
            
            # figure out all the closing tags `</CLOSING>`
            closers = {}
            def trackClosingTags(m):
                global closers
                closers[m.group(1)] = True
            editor.research(r'</(\w+)\s*>', trackClosingTags)
            
            for tag in closers.keys():
                f = r'(?-si:<{0}\b|(?!\A)\G)(?s-i:(?!</{0}\b).)*?\K(?-si:^(?!\h*</{0}))'.format(tag)
                editor.rereplace(f, '\t', re.MULTILINE)
            
            editor.endUndoAction()
            
            

            INPUT FILE:

            <OFX> <SIGNONMSGSRQV1> <SONRQ> <DTCLIENT>20250520104016.123[-7:MST] <USERID>anonymous00000000000000000000000 <USERPASS>X <GENUSERKEY>N <LANGUAGE>ENG <APPID>QWIN <APPVER>2700 </SONRQ> </SIGNONMSGSRQV1> <INTU.BRANDMSGSRQV1> <INTU.BRANDTRNRQ> <TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026 <INTU.BRANDRQ> <FAKE> <OTHER> <TAG> <FAKE> <OTHER> <EMBEDDED> <FAKE> <DEEPER> <OTHER> </DEEPER> <OTHER> </EMBEDDED> </TAG>
            

            OUTPUT:

            <OFX>
            <SIGNONMSGSRQV1>
            	<SONRQ>
            		<DTCLIENT>20250520104016.123[-7:MST]
            		<USERID>anonymous00000000000000000000000
            		<USERPASS>X
            		<GENUSERKEY>N
            		<LANGUAGE>ENG
            		<APPID>QWIN
            		<APPVER>2700
            	</SONRQ>
            </SIGNONMSGSRQV1>
            <INTU.BRANDMSGSRQV1>
            <INTU.BRANDTRNRQ>
            <TRNUID>19FFC8F0-7EF9-1000-BC8D-909811990026
            <INTU.BRANDRQ>
            <FAKE>
            <OTHER>
            <TAG>
            	<FAKE>
            	<OTHER>
            	<EMBEDDED>
            		<FAKE>
            		<DEEPER>
            			<OTHER>
            		</DEEPER>
            		<OTHER>
            	</EMBEDDED>
            </TAG>
            

            Essentially, what the script does:

            1. Puts each <XYZ> or </CCCC> starting on its own line, with no indentation
            2. Figures out all the </CCCC> closing tags (so it knows all the tags which will need to be indented)
            3. For each of those CCCC tags, do the indentation replacement I suggested in the last post
              Since the indentation it does is cumulative, it will properly nest (as shown with my TAG...EMBEDDED...DEEPER hierarchy, for example)

            The script is designed so that after you run the script, if you do Ctrl+Z to UNDO, it will go back to the state before you ran the script.

            If you would prefer to indent using spaces instead of the tab character, just change '\t' in the final editor.rereplace line to ' ' then save the script, before running it.

            The PythonScript FAQ explains everything you need to know for how to install the plugin (either PythonScript 2 or 3 [I recommend 3]), how to create the script by copying from this post, and how to run it.


            note: the above script will also live at https://github.com/pryrt/nppStuff/blob/main/pythonScripts/nppCommunity/27xxx/p27254_PrettyPrintBadXml.py

            Doctor RashirD 1 Reply Last reply Reply Quote 0
            • PeterJonesP PeterJones referenced this topic
            • Doctor RashirD
              Doctor Rashir @PeterJones
              last edited by

              @PeterJones

              Hey, I ran the script. The result looks much much better than before. But this file is an OFX (Open Financial Exchange) and is not truly XML. The sample I posted is only a small part. The rest contains private info so can’t be posted.

              I really appreciate that you spent this time. I think it will work great for my needs.

              1 Reply Last reply Reply Quote 1
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors