Community
    • Login

    RegEx: Split each number of a string inside curly brackets into a separate line, add a prefix to it & remove all unnecessary data

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    16 Posts 5 Posters 1.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones @Grimaldas Grydas
      last edited by PeterJones

      @Grimaldas-Grydas said in RegEx: Split each number of a string inside curly brackets into a separate line, add a prefix to it & remove all unnecessary data:

      However, I think you didn’t have to be rude-ish to me, especially because judging by your reply, it is merely because of overlooking one (though important) markup.

      Sorry, I was not intending to be rude. I cannot hazard a guess at a regex that might work, because your examples don’t show indenting, and your explanation implies there is indentation; if we cannot see where it is, our solutions will likely not work if our guess at indentation doesn’t match yours.

      I see that while I was typing up this reply, you re-posted the data. Thank you.

      Grimaldas GrydasG 1 Reply Last reply Reply Quote 0
      • Grimaldas GrydasG
        Grimaldas Grydas @PeterJones
        last edited by

        @PeterJones

        Thank you for your help this far, anyway! Any responses and comments are always welcome and valuable!

        Concerning indentation, I had no idea it was of any importance in this case! So far, all RegEx I needed for these files could either safely ignore them entirely or could be simply marked with \h* or \h+, occasionally augmented with ^ when beginning was necessary for the code.

        The indentation here is really simple, however. The root entry (82={} in this case) is at level 0, on the margin. Most entries for this case are one TAB in, and a couple rare ones are two TABs in. I thought I could simply add these into the code myself after figuring out how to solve the main issue.

        The end result should have no indentation, so most steps for these files involve methodical removal of any differing “layout”. The end result should be basically as plain as .csv, or .json or similar data typically is.

        I feel it’s not necessary knowledge for this matter, and due to classified information, I cannot tell anything specific… but as mentioned before, I am further processing this into something human-readable. My methods vary from one case to another, but most can be done with a program like Excel, where information can be easily databased, modified with formulae etc.

        1 Reply Last reply Reply Quote 0
        • Grimaldas GrydasG
          Grimaldas Grydas
          last edited by

          There are a couple more things worth mentioning

          I have tested numerous ways, back and forth, including with tester like regex101. The problem appears to be that I’m terrible with lookups and conditionals, so I just cannot wrap my mind around a possible solution. I also have a feeling that the entire process may be impossible with Notepad++ RegEx but some parts could be done. This is why I sought for help in the first place - many of you are far more skilled and may actually know of a solution!

          The original post is complex - I wanted to include all possible information there. However, what needs to be done, is really simple:

          1. We have a string, like header={ 01 0345 0647889 0887 }
          2. We need to capture the part header=
          3. Each number inside brackets (in this case: 01, 0345, 0647889 and 0887), needs to be split into a separate line
          4. End result should omit all brackets and spaces
          5. One of many lines of the end result should look like header=0345

          So far, I have a partial solution. I can easily split numbers into separate lines with SEARCH \h+(\d+\.*\d*) & REPLACE \r\n$1. However, this solution is vulnerable to errors because many other parameters have similar numbers as well. The original matter has spaces as well but I have done this at later stage so that only these have spaces. There are literally thousands of these parameters, and exponentially more with the numbers split, so doing this manually is prone to errors and practically not viable.

          However, this still ignores the brackets, which are the most crucial “identifier” of this case, and the numbers lack a header, which would be imperative to include as there are several parameters using similar numbers. This is why I am asking for help. I would like to know if it would be possible to do the whole process described. English is not my first language (it’s Finnish), so it’s a little difficult to explain this properly…

          It would be piece of cake if the numbers were of regular length, there was a fixed amount of numbers or there was a clear pattern overall. The real difficulty is in this irregularity, as numerous variations must be taken into account… I tried solutions like SEARCH \h+(?:(\w+)\h*=\h*{)\(?:\h+(\d+\.*\d*)) → REPLACE $1=$2 but this only matches the first number out of many, instead of matching all numbers until the closing curly bracket.

          In any case, I am sorry for bothering you all.

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones @Grimaldas Grydas
            last edited by

            @Grimaldas-Grydas ,

            You may have to be patient. I’m pretty busy right now, so cannot look into it. But there are other regex experts who usually visit at least once per day. Hopefully, one of them will be able to look into it.

            1 Reply Last reply Reply Quote 2
            • Terry RT
              Terry R
              last edited by

              @Grimaldas-Grydas said in RegEx: Split each number of a string inside curly brackets into a separate line, add a prefix to it & remove all unnecessary data:

              Because of all these irregularities, combined with certain similarities with other parameters, all RegEx should be done preferably in one go… unless there is a foolproof solution with multiple steps that will not alter other parameters.

              I can’t see how it’s possible to do it with 1 regex. The primary issue is when processing the numbers inside the {} you cannot look behind with a variable length to find the xxx= to copy ahead for the next number found. So instead I think 2 regex will suffice. The first moves the xxx= to the end of the line as the look ahead can be of variable length. The second regex then completes the transformation.

              So the first regex to move the header to end of line (and remove indentation?) is:
              Find What:(?-s)^\h*(\w+=)(\{.+\})
              Replace With:\2\1

              The second regex will now copy the header by looking forward and capturing it for each number it encounters and rewrites that as a separate line. When it cannot find any more numbers on the line it will instead find the }xxx= sequence which it promptly deletes along with the line break. So we have
              Find What:(?-s)(?:\{?\h*)?(\d+(?:\.\d+)?)(?=[^}\r\n]+}(\w+=))|\h*\}\w+=\R
              Replace With:(?1\2\1\r\n)
              Please note that although I included (?-s) in the second regex it is in fact redundant as there are no . references made. It is something I strive to do when starting to compile a solution and sometimes I just leave it there even if not needed.

              Now this definitely works (tested) with the small (non-indented) sample you provided in your 3rd posting, however since you made reference to possible indentation it is likely you may still need to change my regex. Note my first regex does attempt to remove the indentation, but I will leave it up to you if that’s successful before applying the second regex.

              Given the complexity of your data and issues around other lines that look similar this may not be the final solution, but rather a work in progress. Please do come back to us with the edge cases as @PeterJones mentions. His italicised text at the bottom of his first post here contains very important information.

              Terry

              1 Reply Last reply Reply Quote 3
              • Alan KilbornA
                Alan Kilborn @Grimaldas Grydas
                last edited by

                @Grimaldas-Grydas said in RegEx: Split each number of a string inside curly brackets into a separate line, add a prefix to it & remove all unnecessary data:

                I think you didn’t have to be rude-ish to me,

                Hmm, I read it over and I didn’t see even a hint of rude-ish-ness in what Peter said. Maybe he was “direct” but certainly in no way “rude”.

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by

                  Hello, @grimaldas-grydas, @peterjones, @terry-r, @alan-kilborn and All,

                  Here is my solution : A single regex S/R will be enough, but you’ll need to click twice on the Replace All button !

                  I also had to use a temporary character, absent in all your data ! I chose the ¤ character, present on my French keyboard. But you may adopt any simple character which is not present in your current file, as for instance, @, &, %, §, … :-)

                  So if we consider your INPUT text :

                  82={ # This is the root, used for each main entry. All parameters are placed under it. In this case, these are safe to ignore.
                  ### The section below has parameters which need our attention to be fixed with RegEx ###
                  xx={ 16835961 }
                  yyyy={ 16847062 67151971 74997 50388451 72836 83934207 50362874 16845543 81456 81771 67136455 33623075 16849442 100696613 82574 83286 83577 16852101 84199 33607712 }
                  zzz={ 79199 16848761 83893799 70029 76217 16854401 16839 16853836 50370644 145057 79338 81773 16849133 83891875 }
                  www={ 100693891 72513 16844226 33606062 16854968 16858108 33608429 16845608 67128408 33611952 50382602 67148972 67149505 50368894 78657 134238974 67119739 50362812 16833431 16852778 50353593 50378671 50383395 50386109 67120625 67126402 67136958 67145067 67145907 67151704 67158147 83897335 83898254 83921034 83921077 83927103 100681910 100691733 117474361 }
                  pppp={ 50350929 168.36935 33589252 }
                  rrrrr={ 322 482.865 }
                  ### Other stuff in the file looks like this ###
                  info_about_this=blah
                  header=85095
                  Header=words_with_underlines
                  date=1938.08.22
                  that=2437
                  dummy=funny
                  }
                  
                  • Now, open the Replace dialog ( Ctrl + H )

                    • SEARCH (?-s)^(\w+)={(.+)\h+}$|(^)?\h+(\d+(?:\.\d+)?)(?=.+¤(\w+))|¤.+

                    • REPLACE (?2\2¤\1)?4(?3:\r\n)\5=\4

                    • Tick the Wrap around option

                    • Un-tick all other options

                    • Click ONCE, only, on the Replace All button

                  => You should get this intermediate text:

                  82={ # This is the root, used for each main entry. All parameters are placed under it. In this case, these are safe to ignore.
                  ### The section below has parameters which need our attention to be fixed with RegEx ###
                   16835961¤xx
                   16847062 67151971 74997 50388451 72836 83934207 50362874 16845543 81456 81771 67136455 33623075 16849442 100696613 82574 83286 83577 16852101 84199 33607712¤yyyy
                   79199 16848761 83893799 70029 76217 16854401 16839 16853836 50370644 145057 79338 81773 16849133 83891875¤zzz
                   100693891 72513 16844226 33606062 16854968 16858108 33608429 16845608 67128408 33611952 50382602 67148972 67149505 50368894 78657 134238974 67119739 50362812 16833431 16852778 50353593 50378671 50383395 50386109 67120625 67126402 67136958 67145067 67145907 67151704 67158147 83897335 83898254 83921034 83921077 83927103 100681910 100691733 117474361¤www
                   50350929 168.36935 33589252¤pppp
                   322 482.865¤rrrrr
                  ### Other stuff in the file looks like this ###
                  info_about_this=blah
                  header=85095
                  Header=words_with_underlines
                  date=1938.08.22
                  that=2437
                  dummy=funny
                  }
                  

                  Now, click a SECOND time on the Replace All button

                  => And here is your expected OUTPUT text :

                  82={ # This is the root, used for each main entry. All parameters are placed under it. In this case, these are safe to ignore.
                  ### The section below has parameters which need our attention to be fixed with RegEx ###
                  xx=16835961
                  yyyy=16847062
                  yyyy=67151971
                  yyyy=74997
                  yyyy=50388451
                  yyyy=72836
                  yyyy=83934207
                  yyyy=50362874
                  yyyy=16845543
                  yyyy=81456
                  yyyy=81771
                  yyyy=67136455
                  yyyy=33623075
                  yyyy=16849442
                  yyyy=100696613
                  yyyy=82574
                  yyyy=83286
                  yyyy=83577
                  yyyy=16852101
                  yyyy=84199
                  yyyy=33607712
                  zzz=79199
                  zzz=16848761
                  zzz=83893799
                  zzz=70029
                  zzz=76217
                  zzz=16854401
                  zzz=16839
                  zzz=16853836
                  zzz=50370644
                  zzz=145057
                  zzz=79338
                  zzz=81773
                  zzz=16849133
                  zzz=83891875
                  www=100693891
                  www=72513
                  www=16844226
                  www=33606062
                  www=16854968
                  www=16858108
                  www=33608429
                  www=16845608
                  www=67128408
                  www=33611952
                  www=50382602
                  www=67148972
                  www=67149505
                  www=50368894
                  www=78657
                  www=134238974
                  www=67119739
                  www=50362812
                  www=16833431
                  www=16852778
                  www=50353593
                  www=50378671
                  www=50383395
                  www=50386109
                  www=67120625
                  www=67126402
                  www=67136958
                  www=67145067
                  www=67145907
                  www=67151704
                  www=67158147
                  www=83897335
                  www=83898254
                  www=83921034
                  www=83921077
                  www=83927103
                  www=100681910
                  www=100691733
                  www=117474361
                  pppp=50350929
                  pppp=168.36935
                  pppp=33589252
                  rrrrr=322
                  rrrrr=482.865
                  ### Other stuff in the file looks like this ###
                  info_about_this=blah
                  header=85095
                  Header=words_with_underlines
                  date=1938.08.22
                  that=2437
                  dummy=funny
                  }
                  

                  The nice thing is that is you try to click a THIRD time, on the Replace All button, nothing else occurs ;-))

                  I must be out a couple of hours ! See you later for possible modifications and explanations on this regex S/R !

                  Best regards,

                  guy038

                  Grimaldas GrydasG 2 Replies Last reply Reply Quote 4
                  • Grimaldas GrydasG
                    Grimaldas Grydas @guy038
                    last edited by

                    @guy038
                    Thank you so, so much for this! Your RegEx is doing exactly what I needed! I only did a small modification to it to permit matches with indents. I have yet to check how foolproof it is in the long run and whether it would be suitable for other, similar cases in other files I’m working on, but so far it is working perfectly!

                    To be exact, it works perfectly when it is done at a specific stage among a couple dozen other RegEx steps needed for this file, at the point when all other, less problematic cases of “xxx={yyyy}” strings have been fixed, leaving only those behind which need this specific step. However, that is not a problem at all - RegEx works in a way which requires specific order of steps sometimes, and even more so when there is higher complexity involved. In my projects it happens frequently, so I have to do a lot of trial and error to figure out the correct order of replaces. Moreover, these sorts of projects are incredibly interesting for me!

                    In case anyone needs the version I used with indent included - I simply added \h* after ^:
                    (?-s)^\h*(\w+)={(.+)\h+}$|(^)?\h+(\d+(?:\.\d+)?)(?=.+¤(\w+))|¤.+

                    @Terry-R
                    Thank you so much for your version as well! It seems to be working as well, though it is less stable and higher maintenance than the one by @guy038. However, it is still very useful as it has given me ideas and solutions for several other RegEx I am using for these files, so thank you!

                    @Alan-Kilborn
                    I think there is no need to dwell on that matter. There was no harm done whatsoever. How we perceive things is highly individual and biased, depending on the culture, personality and so on. In this case ‘rude’ is a bit extreme wording, hence I added “-ish” there. That comment of mine referred chiefly to the last phrase “I wouldn’t want to even take a stab at an answer yet.”, and the clearly annoyed ‘tone’ because of merely forgetting to add specific markup. Although incredibly helpful for readers, one could phrase such issues more politely.

                    1 Reply Last reply Reply Quote 0
                    • Grimaldas GrydasG
                      Grimaldas Grydas @guy038
                      last edited by

                      @guy038
                      Sorry for multiple replies (again)! I forgot to ask, if it is not trouble, could you please explain your RegEx? These kinds of cases are beyond my current understanding and I’m really interested in learning and improving my skills! Also, this is an unusual and complex case, so someone else could find this useful as well!

                      Thank you again, for your help, time and patience! :-)

                      1 Reply Last reply Reply Quote 0
                      • Grimaldas GrydasG
                        Grimaldas Grydas
                        last edited by

                        Also, there’s no rush, take your time, everyone! I’m sorry if I sounded rushing. I was just trying to write down all while I remembered.

                        Thank you for your help, everyone!

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi, @grimaldas-grydas and All,

                          To begin with, let’s me explain the general method used. we’re going to use a short line, from your INPUT text, which must be processed :

                          pppp={ 50350929 168.36935 33589252 }
                          

                          The goal is to write the three numbers 50350929, 168.36935 and 33589252 , each one on a different line, and prefixed with the string pppp, located before the = sign, in order to get :

                          pppp=50350929
                          pppp=168.36935
                          pppp=33589252
                          

                          The problem is that when the regex engine catches, successively, each number, it does not know anymore the pppp string, located at the beginning of current line !

                          So my idea was to swap the list of numbers and the string pppp before the equal sign and separate these two ranges with a temporary char, not present in your data !

                          So, after a first regex S/R, we get the temporary text, below :

                           50350929 168.36935 33589252¤pppp
                          

                          With this new layout, when the regex engine matches a number ( integer / decimal ) it is fairly easy, with a look-head structure, to store, at each time, the string after the temporary ¤ char, ending the current line !

                          Then, with a second regex S/R, we finally get our expected text :

                          pppp=50350929
                          pppp=168.36935
                          pppp=33589252
                          

                          Before we get into the details, it is IMPORTANT to point out that I found out a case where my previous regex S/R did not work ! So, you’ll have to use the second version, below !

                          The complete regex S/R, where I added the \h* part that you mentioned and where I fixed the bug, is :

                          • SEARCH (?-s)^\h*(\w+)={(.+)\h+}$|(^)?\h+(\d+(?:\.\d+)?)(?=.*¤(\w+))|¤.+

                          • REPLACE (?2\2¤\1)?4(?3:\r\n)\5=\4

                          can be split into 2 consecutive regex S/R, which are completely independent :

                          • The Search/Replacement A, which creates the intermediate text :

                            • SEARCH (?-s)^\h*(\w+)={(.+)\h+}$

                            • REPLACE ?2\2¤\1

                          • The Search/Replacement B, which gets the expected and final text

                            • SEARCH (?-s)(^)?\h+(\d+(?:\.\d+)?)(?=.*¤(\w+))|¤.+

                            • REPLACE ?4(?3:\r\n)\5=\4

                          The groups, defined by the A and B search regexes are :

                          
                          (?x-s) ^ \h* (\w+) = { (.+) \h+ } $
                                        ¯¯¯       ¯¯
                                        Gr 1     Gr 2
                          
                          
                          (?x-s) (^)? \h+ ( \d+(?: \. \d+ )? ) (?= .* ¤ (\w+) ) | ¤ .+
                                  ¯         ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯             ¯¯¯ 
                                Gr 3              Gr 4                  Gr 5	
                          

                          Note, that I use the free-spacing mode (?x) for a better readability and each regex contains the (?-s) in-line modifier which means that any regex . char will match a single standard character ( not EOL ones )

                          • In search regex A :

                            • The part ^\h*(\w+)= matches the word string, stored as group 1, after possible leading blank chars, till an = character

                            • The part {(.+)\h+}$ matches a literal { char, then any non-null range of chars, each number preceded with space(s), which is stored as group 2, till space char(s) and a closing } char, ending the current line

                          • In replacement regex A :

                            • ?2\2¤\1, which should be exactly expressed as (?2\2¤\1), is a conditional replacement syntax, which means that IF group 2 exists, it must rewrite the group 2 first, \2( i.e. the numbers only ), then the literal char ¤ and finally group 1 ( the string pppp )
                          • Now, the search regex B contains two alternatives :

                            • The first alternative (?-s)(^)?\h+(\d+(?:\.\d+)?)(?=.*¤(\w+))

                              • The middle part (\d+(?:\.\d+)?) matches any integer or decimal number, which is stored as group 4. Note the optional non-capturing group (?:\.\d+)? in the case of a decimal number

                              • The first part (^)?\h+ matches matches the blank char(s), preceding a number. Remark that, if the leading blank char(s) begins current line, the optional group 3, (^)?, is then defined

                              • The final part (?=.*¤(\w+)), is a look-ahead structure, not included in the final match, but which must be true in order to get an effective match. So current matched number must be followed by a range, possibly null, of characters till the temporary char ¤ and the ending string pppp

                            • The second alternative ¤.+, which is used when current parsing position of the regex engine is at the ¤ location, after the processed numbers. This second alternative, without any group, simply matches the temporary ¤ char and all subsequent chars of current line, and should be deleted in replacement !

                          • In replacement regex B :

                            • ?4(?3:\r\n)\5=\4, which should be exactly expressed as (?4(?3:\r\n)\5=\4), means that, IF group4 exists ( the numbers ), it must :

                              • Execute, first, the (?3:\r\n) conditional replacement. This replacement does not include a THEN part and, only, the regex \r\n as an ELSE part, after the : char. So, this means that if group 3 does not exist ( number not at beginning of current line ) , it must insert a leading line-break !

                              • Write the group 5, \5, followed with a literal = sign

                              • Finally, write the group 4 ( current number matched by the first alternative of search regex B )

                            • Note that, when matching the second alternative ¤.+ of the search regex B, at end of current line, group 4 is not defined. So, no action occurs in replacement. Thus, concretely, this means that the string ¤pppp is deleted !


                          Remarks :

                          • The S/R A and B are independent. As a demonstration :

                            • When executing, first, the search regex A, as no ¤ character already exists, each alternative of the search regex B cannot match

                            • When executing, in a second time, the search regex B, as the intermediate text ( after running A ) does not contain any { nor } characters, obviously, the search regex A cannot match, too !

                          Thus, we can merge these two successive S/R in one regex S/R only ! You’ll note that :

                          • The redundant part (?-s), at beginning of regex S/R B, is omitted

                          • The replacement of S/R A, ?2\2¤\1, must be enclosed between parentheses, (?2\2¤\1), in order to not include the replacement section of S/R B

                          As a conclusion, the complete regex S/R, with the free-spacing mode in the search part, is :

                          • SEARCH (?x-s) ^ \h* ( \w+ ) = { ( .+ ) \h+ } $ | (^)? \h+ ( \d+ (?:\.\d+)? ) (?= .* ¤ ( \w+ ) ) | ¤ .+

                          • REPLACE (?2\2¤\1)?4(?3:\r\n)\5=\4

                          And outputs the expected text, after two consecutive clicks on the Replace All button !


                          As mentioned in my last post, if we try to click a third time on the Replace All button, luckily, nothing else occurs ! Why ? Easy : as brace { or } characters nor ¤ character exists in our final text, any alternative of the overall regex cannot match. Logical ;-))

                          I just hope, @grimaldas-grydas, that these explanations help you a bit !

                          guy038

                          Grimaldas GrydasG 1 Reply Last reply Reply Quote 2
                          • Grimaldas GrydasG
                            Grimaldas Grydas @guy038
                            last edited by

                            This post is deleted!
                            1 Reply Last reply Reply Quote 0
                            • Terry RT Terry R referenced this topic on
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors