Community
    • Login

    How to find numbers in multiline in Notepad++

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    21 Posts 5 Posters 10.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Neil SchipperN
      Neil Schipper @PeterJones
      last edited by

      @peterjones I agree that the problem statement is very confusing and incomplete. However, a ruleset that may be in play is:

      Extract 4 numbers from each record, as follows:

      1. the argument to the leading aaaa field, with the presence of the field name is optional
      2. the argument to the abcd field, which optionally contains a second number which is extracted and used as the 3rd num
      3. the argument to the efgh field, unless (it’s absent and) it was already extracted with abcd
      4. the argument to the xyz field

      All other fields and their arguments are ignored.

      I’m not entirely sure it’s within my regex powers to provide a solution, although I imagine some other folk here would enjoy the challenge.

      In any case, I wouldn’t strain myself to try until @gelle_marrisa makes an effort to clarify.

      1 Reply Last reply Reply Quote 2
      • gelle_marrisaG
        gelle_marrisa @PeterJones
        last edited by gelle_marrisa

        This post is deleted!
        1 Reply Last reply Reply Quote 0
        • gelle_marrisaG
          gelle_marrisa @PeterJones
          last edited by gelle_marrisa

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • gelle_marrisaG
            gelle_marrisa @PeterJones
            last edited by

            @peterjones

            @peterjones

            Thanks Peter & everyone for replying. [^\d]+ or \D+ worked perfectly. Even the output is coming as 1234567891|111|22|111|22|333|456|01010|8888899999. I can filter this and limit the characters, and it’s fine as make the final output as 1234567891|333|111|22.

            I was trying this regex to get \d{10}, and it selects all 10 digits, but I am not sure how to use it in multiline format. I tried this
            \d{10}
            \d{3}/\d{2}
            \d{3}
            or used \n for next line

            and it didn’t worked, obviously I made myself fool as am not aware about regex formats.

            But i am facing issue again, Suppose there are 3-4 alternative of example one (as per OP) , this regex (\D+), it combines all the letters in one row instead of separate rows like below

            1234567891|111|22|111|22|333|456|01010|8888899999|1234567891|111|22|111|22|333|456|01010|8888899999|1234567891|111|22|111|22|333|456|01010|8888899999

            If it can give me result like this
            1234567891|111|22|111|22|333|456|01010|8888899999
            1234567891|111|22|111|22|333|456|01010|8888899999
            1234567891|111|22|111|22|333|456|01010|8888899999

            I can filter it with character limits. any solution to keep rows

            I hope I am able to clarify my question this time. Love to everyone. :)

            PeterJonesP Neil SchipperN 2 Replies Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @gelle_marrisa
              last edited by

              @gelle_marrisa said in How to find numbers in multiline in Notepad++:

              or used \n for next line

              \n means the LF portion of the Windows-standard CRLF sequence; the CR portion is \r. If you want to match a newline in your regular expression, you thus need to use \r\n. Alternatively, in Notepad++'s “Boost regular expressions”, \R will match a Windows-style \r\n or a linux-style \n or even an outdated Mac-style \r.

              So if you want to match 10 digits, a newline, 3 digits, a slash, 2 digits, a newline, and 3 digits, use \d{10}\R\d{3}/\d{2}\R\d{3}
              ----

              Useful References

              • Please Read Before Posting
              • Template for Search/Replace Questions
              • FAQ: Where to find regular expressions (regex) documentation
              • Notepad++ Online User Manual: Searching/Regex
              gelle_marrisaG 2 Replies Last reply Reply Quote 0
              • Neil SchipperN
                Neil Schipper @gelle_marrisa
                last edited by

                @gelle_marrisa said in How to find numbers in multiline in Notepad++:

                it combines all the letters in one row instead of separate rows

                any solution to keep rows

                So what you hope to achieve is one line of data per record, yes? Your file consists of records with slightly different formats (for example, some but perhaps not all records start with aaaa:: ), yes? So it will be necessary to be able to reliably determine when one record ends and the next one starts. You should try to describe all the record-boundary conditions, or, provide examples covering every type of record that you expect to be encountered and from which you want numbers extracted.

                111/22 or 111|22 (another format)

                An expression that will match either format is: \d{3}[/|]\d{2}

                gelle_marrisaG 1 Reply Last reply Reply Quote 0
                • gelle_marrisaG
                  gelle_marrisa @PeterJones
                  last edited by gelle_marrisa

                  @peterjones
                  thanks. it gives me pattern error.
                  https://regex101.com/r/p5XPbT/1/

                  1 Reply Last reply Reply Quote 0
                  • gelle_marrisaG
                    gelle_marrisa @Neil Schipper
                    last edited by

                    @neil-schipper
                    this didn’t worked even. can you share an example in regex tester

                    Neil SchipperN 1 Reply Last reply Reply Quote 0
                    • gelle_marrisaG
                      gelle_marrisa @PeterJones
                      last edited by

                      @peterjones
                      suppose if there is only 1 format for abcd area(no slash), then is there possibility to get the solution? I can replace all the slashes in first place and then use the regex.
                      1234567891
                      abcd :
                      111|22
                      xyz :
                      333
                      product :
                      blablabla 456
                      code :
                      01010
                      serial :
                      8888899999

                      1 Reply Last reply Reply Quote 0
                      • Neil SchipperN
                        Neil Schipper @gelle_marrisa
                        last edited by

                        @gelle_marrisa The regex I provided you is tested on the text quoted just above it. All it does is match either of the 6 character strings in the quote. I provided it on the assumption that you were trying to learn techniques to help solve your overall problem. It was not intended as a complete solution.

                        If you want help with a complete solution, you will need to read, with care, with attention, with seriousness, my remarks about the importance of being able to determine the start and the end of records in your data.

                        gelle_marrisaG 1 Reply Last reply Reply Quote 0
                        • gelle_marrisaG
                          gelle_marrisa @Neil Schipper
                          last edited by

                          @neil-schipper
                          lets forget the slash value, if there is only 111|22 in 2nd or 3rd line, can we get the desired result?

                          \d{10}\R\d{3}/\d{2}\R\d{3} this shows pattern error,

                          I am noob, \d{3}[/|]\d{2} not sure to use it as complete pattern or i have to merge it with any previous pattern that was mentioned above.

                          Neil SchipperN 1 Reply Last reply Reply Quote 0
                          • Neil SchipperN
                            Neil Schipper @gelle_marrisa
                            last edited by

                            @gelle_marrisa

                            To convert this:

                            12
                            abcd :
                            115/22
                            xyz :
                            333
                            product :
                            blablabla 4567
                            code :
                            01010
                            serial :
                            34
                            
                            56
                            abcd :
                            116|22
                            xyz :
                            333
                            product :
                            blablabla 45678
                            code :
                            01010
                            serial :
                            78
                            
                            90
                            abcd :
                            117|22
                            xyz :
                            333
                            product :
                            blablabla 456789
                            code :
                            01010
                            serial :
                            12
                            
                            

                            into this:

                            12|115|22|333|4567|01010|34
                            56|116|22|333|45678|01010|78
                            90|117|22|333|456789|01010|12
                            

                            You can use:

                            F: (\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+
                            R: $1|$2|$3|$4|$5|$6|$7\r\n
                            Set cursor to the left of first number of first record
                            Execute Replace All

                            It will only work on your whole file if every record has exactly 7 numbers.

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hello, @neil-schipper, @gelle_marrisa, @peterjones and All,

                              An other solution which does not depend on the number of lines of a section would be :

                              • SEARCH \D+((^\r\n)+|\z)|\D+

                              • REPLACE ?1\r\n:|

                              • Tick the Wrap around option

                              • Click on the Replace All button

                              Of course, I assume that each section is separated by, at least, one pure empty line

                              So, from this INPUT text :

                              12
                              abcd :
                              115/22
                              xyz :
                              333
                              product :
                              blablabla 4567
                              code :
                              01010
                              serial :
                              34
                              
                              
                              
                              
                              56
                              abcd :
                              116|22
                              xyz :
                              
                              
                              
                              
                              90
                              abcd :
                              product :
                              blablabla 456789
                              code :
                              serial :
                              12
                              

                              You would obtain this expected text :

                              12|115|22|333|4567|01010|34
                              56|116|22
                              90|456789|12
                              

                              Note that if we try to factorize the search regex expression as below :

                              • SEARCH \D+(((^\r\n)+|\z)|)

                              • REPLACE ?2\r\n:|

                              This regex S/R does not work properly and gives this output :

                              12|115|22|333|4567|01010|34|56|116|22|90|456789|12
                              

                              So, why, in this new regex, the case \D+(^\r\n)+ never occurs ? For instance, after the number 34, ending the first section of my exemple ? Well, we have this range of chars : 34\r\n\r\n\r\n\r\n\r\n56. So :

                              • First, the regex \D+ matches \r\n\r\n\r\n\r\n\r\n but would need some backtraking process in order that the first alternative \D+(^\r\n)+ matches this same range

                              • As the whole regex contains other alternatives, the regex engine, before backtracking, tries a match attempt with the second alternative. However, the regex \D+\z cannot be applied to, at this position !

                              • Finally, the regex engine tries the last empty alternative \D+() which, of course, matches the range \r\n\r\n\r\n\r\n\r\n

                              This explains why the gap between two sections is never detected with this second version of the regex S/R

                              Best Regards,

                              guy038

                              Neil SchipperN 1 Reply Last reply Reply Quote 1
                              • guy038G
                                guy038
                                last edited by guy038

                                Hello, @neil-schipper, @gelle_marrisa, @peterjones and All,

                                My reasoning, at the end of my previous post, about the second form of regex \D+(((^\r\n)+|\z)|) is not exact ! Indeed, I said :

                                As the whole regex contains other alternatives, the regex engine, before backtracking, tries a match attempt with the second alternative

                                But, in this case, the correct search regex of my previous post \D+((^\r\n)+|\z)|\D+, which also contains an alternation, should show the same behavior and always choose the second alternative \D+ ?!

                                I’ve tried to find out an explanation, without any success :-( May be, one of yours will be able to find out a plausible one !


                                In brief, even simplifying the first version by omitting the \z case , and given this INPUT text, with a blank line after the last 12 number

                                12
                                abcd :
                                115/22
                                xyz :
                                333
                                product :
                                blablabla 4567
                                code :
                                01010
                                serial :
                                34
                                
                                
                                
                                
                                56
                                abcd :
                                116|22
                                xyz :
                                
                                
                                
                                
                                90
                                abcd :
                                product :
                                blablabla 456789
                                code :
                                serial :
                                12
                                
                                

                                Why the regex S/R :

                                • SEARCH \D+(^\r\n)+|\D+

                                • REPLACE ?1\r\n:|

                                gives :

                                12|115|22|333|4567|01010|34
                                56|116|22
                                90|456789|12|
                                

                                And this second equivalent S/R :

                                • SEARCH \D+((^\r\n)+|)

                                • REPLACE ?2\r\n:|

                                gives this result :

                                12|115|22|333|4567|01010|34|56|116|22|90|456789|12|
                                

                                ???

                                BR

                                guy038

                                P.S. :

                                The problem does not comes from the empty alternative. For instance, the regex abc(def|) does find, either, the strings abcdef and just abc !

                                1 Reply Last reply Reply Quote 0
                                • Neil SchipperN
                                  Neil Schipper @guy038
                                  last edited by

                                  @guy038 said in How to find numbers in multiline in Notepad++:

                                  solution which does not depend on the number of lines of a section

                                  Very nice solution. I can see its applicability and am glad to know it so thanks for sharing.

                                  I won’t be much help on the follow-up discussion. I’m not even clear on what motivated you to go in this direction:

                                  if we try to factorize the search regex expression

                                  However, in trying to understand one building block of your newer regex, which includes a null in an OR subexpression, I encountered something confusing. I wanted to know “does a captured null return true or false?”

                                  So I ran ‘Replace All’ with F=(), R=?1dog:cat on a few cases.

                                  In the case of a new empty file, there are 0 matches. This seems wrong, although I wouldn’t be surprised if a more experienced regex person would say it’s correct and expected (because maybe in the docs it says “no text ==> no matches” or maybe, “a zero-length null only occurs before or after a character”).

                                  In the case of a file with the single character ‘p’ there are 2 matches and we get dogpdog which seems reasonable.

                                  1 Reply Last reply Reply Quote 1
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hello, @Neil-Schipper and All,

                                    I had never done this test :

                                    SEARCH ()

                                    REPLACE (?1dog:cat)

                                    Interesting ! You said :

                                    In the case of a new empty file, there are 0 matches. This seems wrong,…

                                    Well, your assertion is a bit philosophical : does an empty file contains a single empty string ( or an infinity ! ) ?

                                    Note that , in regex mode, the search of () ( an empty group 1 ) does show the ^ zero length match calltip, when applied to a new empty tab or a zero byte file !

                                    However, as you said, even a simple replacement with a dummy string, as for instance Test, does not occur and no text is inserted !

                                    Now, if I type in the phrase This is a test in a new tab and I use the regex S/R :

                                    SEARCH ()

                                    REPLACE ?1:|:x

                                    I get, after clicking on the Replace All button, with the Wrap around option ticked, the text :

                                    |T|h|i|s| |i|s| |a| |t|e|s|t|
                                    

                                    And, to my mind, all this is quite logic :

                                    • The group 1 is defined and contains an empty string

                                    • Technically, an empty string does exist between two characters, as well as before the first char and after the last. So each occurrence is changed into the | char

                                    Note that we can obtain the same result with this other regex S/R :

                                    SEARCH (.{0})

                                    REPLACE ?1:|:x

                                    and also with the more simple forms :

                                    SEARCH ()

                                    REPLACE |

                                    or

                                    SEARCH .{0}

                                    REPLACE |


                                    As we’re speaking about empty groups, I would like to mention a particular but important point when using conditional structures, in regex mode :

                                    Let’s consider this list :

                                    Ted=First Name
                                    25=Age
                                    Mary=First Name
                                    75=Age
                                    Elisabeth=First Name
                                    47=Age
                                    Bob=First Name
                                    62=Age
                                    

                                    Let’s introduce the conditional regex structure (?(1)Age|First Name) which means : if a group 1 has been previously defined, in the search regex, searches for the string Age else searches for the string First Name

                                    If we build the regex (?-si)^(\d*).*=(?(1)Age|First Name)$, you could say :

                                    • If a line begins with a number, the part \d* matches this number, the part .* matches an empty string = matches the equal sign and the conditional bloc (?(1)Age|First Name) matches the string Age as the group 1 contains the number and is defined

                                    • If a line does not begin with a number, the part \d* matches an empty string, the part .* matches the first name, = matches the equal sign and the conditional bloc (?(1)Age|First Name) matches the string First Name as the group 1 is not defined and empty

                                    However, running this regex, against our text, it matches only the lines relative to the age and not all the lines. Why ?

                                    Well, what really represents the (\d*) group, after the ^ assertion :

                                    • If a line begins with some digits, no problem : group 1 is defined and contains the number

                                    • Now, if a line does not begin with digits, the group 1 is ALSO defined but contains an empty string

                                    Thus, in all cases the group 1 is defined;, breaking the normal behaviour of the conditional part (?(1)Age|First Name)

                                    To get a functional overall regex, you need to change this non-optional group 1 (\d*) into an optional group, with a non-optional contents…, thanks to the syntax (\d+)?. Then, the search regex becomes :

                                    (?-si)^(\d+)?.*=(?(1)Age|First Name)$

                                    This time :

                                    • If a line begins with a number, the optional part (\d+)? matches this number and the group 1 is clearly defined and contains this number

                                    • But, if a line does not begin with a number the optional part (\d+)? matches nothing and the group 1 is not defined at all !

                                    You can verify that this final regex find, as expected, all the lines of our text !

                                    Remark : Of course, we could had simply used the regex (?-si)^(\d+=Age|.+=First Name)$, without any conditional block !


                                    This reasoning can be applied, as well, to conditional replacements ! For instance, given this text :

                                    Ted
                                    25
                                    Mary
                                    75
                                    Elisabeth
                                    47
                                    Bob
                                    62
                                    

                                    The following regex S/R :

                                    SEARCH (?-s)^(\d+)?.*$

                                    REPLACE (?1Age:First Name) : $&

                                    would gives :

                                    First Name : Ted
                                    Age : 25
                                    First Name : Mary
                                    Age : 75
                                    First Name : Elisabeth
                                    Age : 47
                                    First Name : Bob
                                    Age : 62
                                    
                                    • If a number begins a line, group 1 is defined and the string Age, followed with \x20:\x20, is inserted right before the number

                                    • If a number does not begin a line, the group 1 is not defined at all. So the string First Name, followed with \x20:\x20, is inserted, this time, right before the first name

                                    And you’ll verify, that the similar version, with the non-optional group 1 (\d*) :

                                    SEARCH (?-s)^(\d*).*$

                                    REPLACE (?1Age:First Name) : $&

                                    gives wrong results, with the string "Age : " ALWAYS inserted :-((

                                    Best Regards,

                                    guy038

                                    Neil SchipperN 1 Reply Last reply Reply Quote 0
                                    • Neil SchipperN
                                      Neil Schipper @guy038
                                      last edited by

                                      Good write up, @guy038. It’s good to know there’s a way to have a group conditionally defined as you showed.

                                      To get a functional overall regex, you need to change this non-optional group 1 (\d*) into an optional group, with a non-optional contents…, thanks to the syntax (\d+)?

                                      At first it seemed like this property of (spec+)? was an anomaly being exploited, or maybe an afterthought by the regex authors, but upon reflection there is some sense to it:

                                      In cases where (spec) has no match…

                                      • with (spec*) the (little man in the) machine says, "you asked for a capture group containing zero or more matches, so I’m giving you a capture group that contains null text; and a thing which contains surely must be defined.

                                      • with (spec+)? the (little man in the) machine says, "you asked for zero or one capture groups containing matched text, so I give you zero such groups, and a thing of which there are zero (in compsci) has no memory allocated and no address, ie, is undefined

                                      After realizing this, I wondered if some sticky situations might arise using this technique when there’s a sequence of these conditionally defined groups (ConDefGrps for short). Consider an expression in which all capture groups (CaptGrps) are also ConDefGrps, and, say the first ConDefGrp doesn’t match, so a CaptGrp isn’t defined, but, the second one does; since this latter one is the first CaptGrp that “comes to life”, wouldn’t its reference be 1 so that any conditional test on it (no matter if later in the same expression or in the substitution statement) would actually be testing for the existence of that second appearing, first defined CaptGrp?

                                      So I set up a test to check this.

                                      Consider a scheme in which a valid code consists of zero or more number 1’s, then 2’s, then 3’s, in that order, with at least one element present.

                                      An expression that only matches lines completely filled by a valid code is: ^(?=\S)([1]+)?([2]+)?([3]+)?$ but that’s not so interesting.

                                      Here’s an F/R pair that always captures the whole line whether it contains valid codes or not, and then writes it back with information about each group’s existence appended:

                                      F: ^([1]+)?([2]+)?([3]+)?.*$
                                      R: $0 - groups (?{1}1:.)(?{2}2:.)(?{3}3:.)

                                      When applied to this test data:

                                      1
                                      2
                                      3
                                      
                                      112
                                      1222222223
                                      2233111111
                                      4
                                      4123
                                      12z3
                                      1111111222223333
                                      31
                                      32
                                      1133
                                      

                                      we obtain:

                                      1 - groups 1..
                                      2 - groups .2.
                                      3 - groups ..3
                                       - groups ...
                                      112 - groups 12.
                                      1222222223 - groups 123
                                      2233111111 - groups .23
                                      4 - groups ...
                                      4123 - groups ...
                                      12z3 - groups 12.
                                      1111111222223333 - groups 123
                                      31 - groups ..3
                                      32 - groups ..3
                                      1133 - groups 1.3
                                      

                                      What the above demonstrates is that when a ConDefGrp is encountered in an expression, even though it may remain “undefined” (and return False in an existence test) it still consumes a group number allocated in the normal fashion.

                                      Thus, one need not worry that including multiple ConDefGrps might lead to ambiguity in the mapping of group to group number.

                                      1 Reply Last reply Reply Quote 2
                                      • guy038G
                                        guy038
                                        last edited by guy038

                                        Hi, @neil-schipper and All,

                                        To summarize :

                                        • With the syntax ^(1+)?•••••, group 1 must contain some 1'. So, if no 1' can be found in text, the group 1 cannot be defined and is not used as optional
                                          (? quantifier )

                                        • With the syntax ^(1*)•••••, group 1 may or not contain some 1'. So, if no 1' can be found in text, the group 1 is still defined with empty contents
                                          (* quantifier )

                                        • With the syntax ^(1)*•••••, group 1 must contain one 1'. So, if no 1' can be found in text, the group 1 cannot be defined and is not used as optional
                                          (* quantifier )


                                        So, given the text :

                                        000000 |
                                        111111 |
                                        222222 |
                                        333333 |
                                        111222 |
                                        111133 |
                                        223333 |
                                        112233 |
                                        

                                        The regex S/R :

                                        SEARCH (?-s)^(1+)?(2+)?(3+)?.+

                                        REPLACE $0 groups (?{1}1:.)(?{2}2:.)(?{3}3:.)

                                        gives :

                                        000000 | groups ...
                                        111111 | groups 1..
                                        222222 | groups .2.
                                        333333 | groups ..3
                                        111222 | groups 12.
                                        111133 | groups 1.3
                                        223333 | groups .23
                                        112233 | groups 123
                                        

                                        The regex S/R :

                                        SEARCH (?-s)^(1*)(2*)(3*).+

                                        REPLACE $0 groups (?{1}1:.)(?{2}2:.)(?{3}3:.)

                                        gives :

                                        000000 | groups 123
                                        111111 | groups 123
                                        222222 | groups 123
                                        333333 | groups 123
                                        111222 | groups 123
                                        111133 | groups 123
                                        223333 | groups 123
                                        112233 | groups 123
                                        

                                        And the regex S/R :

                                        SEARCH (?-s)^(1)*(2)*(3)*.+

                                        REPLACE $0 groups (?{1}1:.)(?{2}2:.)(?{3}3:.)

                                        gives :

                                        000000 | groups ...
                                        111111 | groups 1..
                                        222222 | groups .2.
                                        333333 | groups ..3
                                        111222 | groups 12.
                                        111133 | groups 1.3
                                        223333 | groups .23
                                        112233 | groups 123
                                        

                                        BR

                                        guy038

                                        1 Reply Last reply Reply Quote 0
                                        • Alan KilbornA
                                          Alan Kilborn
                                          last edited by

                                          So this is a good discussion thread, but the choice to use literal 1, 2, 3 in the examples IMO wasn’t the best for the utmost clarity. :-)

                                          1 Reply Last reply Reply Quote 0
                                          • First post
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors