Community
    • Login

    Regex to find any lines that do NOT have a specific number of a character

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    30 Posts 7 Posters 9.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Mark YorkovichM
      Mark Yorkovich
      last edited by

      I’m working with larger pipe-delimited csvs (110k plus lines) and need to be able to find any lines in the file that do not have a specific number of pipes - one file should have 9 pipes (10 columns) and another file should have 16 pipes (17 columns).

      How do I do that?

      dinkumoilD 1 Reply Last reply Reply Quote 0
      • EkopalypseE
        Ekopalypse
        last edited by Ekopalypse

        @Mark-Yorkovich - sorry no answer just another question.

        Why the hell is this not working?

        Assuming the text

        1|2|3|4|5|6|7|8|9|10
        1|2|3|4|5|6|7|8|9
        1|2|3|4|5|6|7|8|9|10|11
        

        and using ^((.+?\|){9})(?!(.+?\|)) to find only lines with 10 columns.
        It matches the line with 11 columns, why?

        1 Reply Last reply Reply Quote 1
        • Mark YorkovichM
          Mark Yorkovich
          last edited by

          @Ekopalypse said:

          ^((.+?|){9})(?!(.+?|))

          Yeah - you’re kinda highjacking my question. You’re looking for the complete opposite of what I am. Perhaps you could open your own post?

          Just a suggestion…

          EkopalypseE 1 Reply Last reply Reply Quote 0
          • EkopalypseE
            Ekopalypse @Mark Yorkovich
            last edited by

            @Mark-Yorkovich

            Ja, I tried to solve it the other way. If you know the lines which do match the requirement
            one could mark it, reverse the marking and now you have the lines which do not fulfill the
            requirement.

            Alan KilbornA 1 Reply Last reply Reply Quote 2
            • Alan KilbornA
              Alan Kilborn @Ekopalypse
              last edited by

              @Ekopalypse said:

              reverse the marking

              Eko means use bookmarking feature of the Mark command to match the lines you aren’t interested in finding, then invert the bookmarks (see the Search > Bookmarks menu) to get the lines you are interested in.

              1 Reply Last reply Reply Quote 1
              • Alan KilbornA
                Alan Kilborn
                last edited by Alan Kilborn

                This expression seems to put some redmarking (and thus you could bookmark on this basis) on only those lines that have exactly 9 pipes:

                (?-s)^([^|\r\n]*?\|){9}(?!(?:.*?\|))

                Using [^...] without putting line-ending characters inside makes me nervous, so I’ve done so above; not sure they are relevant here. :)

                Obviously the {9} can be modified if nine is not really what is needed.

                I am confused by why Eko’s attempt does not work.

                EkopalypseE 1 Reply Last reply Reply Quote 2
                • EkopalypseE
                  Ekopalypse @Alan Kilborn
                  last edited by Ekopalypse

                  @Alan-Kilborn

                  Alan, was this intentional?
                  (?-s)^([^|\r\n]*?\|){9}(?!(?:.*?\|))
                  or should it be
                  (?-s)^([^\|\r\n]*?\|){9}(?!(?:.*?\|))
                  (which by the way doesn’t seem to have any impact if used or not)

                  Now given your working example this works also
                  (?-s)^([^\|]*?\|){9}(?!.*?\|)

                  But I don’t understand why there is a need to make sure that a line
                  does not start with a pipe.

                  Alan KilbornA 1 Reply Last reply Reply Quote 1
                  • dinkumoilD
                    dinkumoil @Mark Yorkovich
                    last edited by

                    @Ekopalypse

                    The following regex does the job: ^(?>.+?\|){9}(?!.+?\|). I’m not sure why but it seems to be related to the lack of backtracking due to ?> which turns group one to a non-capturing group.

                    @Mark-Yorkovich

                    In the Search & Replace dialog go to the Mark register:

                    Find what: ^(?>.+?\|){9}(?!.+?\|)
                    Bookmark line: ticked
                    Purge for each search: ticked
                    Wrap around: ticked
                    Regular expression: ticked

                    Click Mark All. Go to (menu) Search -> Bookmark -> Inverse Bookmark. Now all lines which do not contain exactly 9 pipe characters are bookmarked.

                    You can navigate to these lines with F2 (next bookmark) and SHIFT+F2 (previous bookmark).

                    You can also remove these lines by clicking (menu) Search -> Bookmark -> Remove bookmarked lines.

                    You can also do the opposite (removing not bookmarked lines) by clicking (menu) Search -> Bookmark -> Remove unmarked lines.

                    EkopalypseE 1 Reply Last reply Reply Quote 2
                    • Mark YorkovichM
                      Mark Yorkovich
                      last edited by

                      Eko’s exp works for me to find rows with 9 pipes/10 cols. Alan’s exp doesn’t match anything in my file with mostly 9 pipes/10 cols with a few known rows with less than 9 pipes.

                      I’m trying to match on rows with greater than or less than 9 pipes.

                      Alan KilbornA 1 Reply Last reply Reply Quote 0
                      • Alan KilbornA
                        Alan Kilborn @Ekopalypse
                        last edited by

                        @Ekopalypse

                        Was removing the escaping of the | inside the [ and ] intentional? Yes, I suppose, since it has no special meaning there and doesn’t need escaping.

                        don’t understand why there is a need to make sure that a line does not start with a pipe

                        I think that with this type of data, fields could be empty, thus if the first field is empty a line would start with a pipe? But, is the regex really saying what I think you implied? I’m saying “not pipe” not just at the start of a line, but for in between fields as well. And I’m only doing it this way because your original attempt using a . expression fails (for some odd and as yet unknown reason). I think I’m getting confused.

                        1 Reply Last reply Reply Quote 0
                        • EkopalypseE
                          Ekopalypse @dinkumoil
                          last edited by

                          @dinkumoil

                          ok, I hope I finally understood this sentence

                          Match pattern independently of surrounding patterns, and don’t backtrack into it. Failure to match will cause the whole subject not to match.

                          which then means that my first attempt, which I was questioning, did backtrack.
                          which makes your regex is the one which I, and hopefully @Mark-Yorkovich were looking for.

                          @Alan-Kilborn,
                          Alan, ja, I guess you are right.

                          @Mark-Yorkovich, so does this work on your data and the procedure described by
                          @dinkumoil ?

                          dinkumoilD 1 Reply Last reply Reply Quote 1
                          • dinkumoilD
                            dinkumoil @Ekopalypse
                            last edited by dinkumoil

                            @Ekopalypse said:

                            ok, I hope I finally understood this sentence

                            I got the following hint at https://regex101.com/ when trying your regex:

                            A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you’re not interested in the data.

                            Then I read https://www.regular-expressions.info/atomic.html

                            Together it made me to give the non-capturing group a try.

                            1 Reply Last reply Reply Quote 1
                            • Mark YorkovichM
                              Mark Yorkovich
                              last edited by

                              @dinkumoil
                              I followed your instructions, but I’m not getting any matches.

                              EkopalypseE 1 Reply Last reply Reply Quote 0
                              • EkopalypseE
                                Ekopalypse @Mark Yorkovich
                                last edited by Ekopalypse

                                @Mark-Yorkovich

                                make sure your caret is on the first line if you have not checked wrap around

                                Mark YorkovichM 1 Reply Last reply Reply Quote 0
                                • Mark YorkovichM
                                  Mark Yorkovich @Ekopalypse
                                  last edited by

                                  @Ekopalypse
                                  Yup, sure is. - No matches - double-checked my settings.

                                  To reiterate: My file is mostly 9 pipes/10 cols per line, but some have less and a few more than that and I need to find those.

                                  dinkumoilD 1 Reply Last reply Reply Quote 0
                                  • dinkumoilD
                                    dinkumoil @Mark Yorkovich
                                    last edited by dinkumoil

                                    @Mark-Yorkovich

                                    I generated with the test data of @Ekopalypse a file of 146545 lines and did that what I’ve suggested above - I got the expected result.

                                    Be sure that the pipe character in your file is really a pipe character (code 124). There is another one (code 166 in Windows-1252 character encoding) which looks nearly identical:

                                    Pipe character: |
                                    The other one: ¦

                                    Mark YorkovichM 1 Reply Last reply Reply Quote 1
                                    • Mark YorkovichM
                                      Mark Yorkovich @dinkumoil
                                      last edited by

                                      @dinkumoil said:

                                      @Mark-Yorkovich

                                      I generated with the test data of @Ekopalypse a file of 146545 lines and did that what I’ve suggested above - I got the expected result.

                                      Be sure that the pipe character in your file is really a pipe character (code 124). There is another one (code 166 in Windows-1252 character encoding) which looks nearly identical:

                                      Pipe character: |
                                      The other one: ¦

                                      Yup - they’re pipes.

                                      Here is a good sample of what I’m working with. Lines 1, 9, 10, 11, 16 thru 20 and 36, 37 are single-line records with 9 pipes and 10 columns. Lines 2 thru 8 are one record and together have 9 pipes/10 cols. Similarly, lines 12 through 15 are a single record, and lines 21 thru 35 are a single record.

                                      LOREM120|8 |3 |1 |1 |0 |0 |||INST020
                                      LOREM120|9 |1 |1 |0 |0 |0 ||Lorem Ipsum Dolor]
                                      LOREM: BS/BP

                                      LOREM IPSUM:
                                      Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.|
                                      IPSUM16|1 |1 |1 |1 |0 |0 |||3001479
                                      IPSUM16|1 |2 |1 |1 |0 |0 |||3003077
                                      IPSUM16|11 |0 |1 |0 |0 |0 |||
                                      IPSUM16|13 |0 |1 |0 |0 |0 ||Lorem ipsum dolor sit amet
                                      consectetur adipiscing elit,
                                      sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

                                      DOLOR53 1 1 1 2 0 0 3003084
                                      DOLOR53 2 3 1 1 0 0 Lorem ipsum
                                      DOLOR53 2 4 1 1 0 0 Lorem ipsum
                                      LOREM56 8 1 1 1 0 0 Lorem ipsum
                                      LOREM56 8 2 1 1 0 0 Lorem ipsum
                                      LOREM56 9 1 1 0 0 0 Lorem ipsum dolor sit amet

                                      consectetur adipiscing elit

                                      consectetur adipiscing elit
                                      consectetur adipiscing elit

                                      consectetur adipiscing elit
                                      Lorem ipsum dolor sit amet
                                      sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

                                      Lorem ipsum dolor sit amet
                                      consectetur adipiscing elit
                                      Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.|
                                      DOLOR19|1 |2 |1 |1 |0 |0 |||3003124
                                      LOREM01|1 |1 |1 |1 |1 |0 |||3003024

                                      Your suggested regex ^(?>.+?|){9}(?!.+?|) isn’t finding any matches on that

                                      EkopalypseE 1 Reply Last reply Reply Quote 0
                                      • EkopalypseE
                                        Ekopalypse @Mark Yorkovich
                                        last edited by

                                        @Mark-Yorkovich

                                        because it was assumed that all columns contain data

                                        find: ^(?>.*?\|){9}(?!.*?\|) does not make that assumption.

                                        Mark YorkovichM 1 Reply Last reply Reply Quote 0
                                        • Mark YorkovichM
                                          Mark Yorkovich @Ekopalypse
                                          last edited by

                                          @Ekopalypse said:

                                          @Mark-Yorkovich
                                          because it was assumed that all columns contain data

                                          My bad. I didn’t give you all of the details of what I’m working with.

                                          find: ^(?>.*?\|){9}(?!.*?\|) does not make that assumption.

                                          This works.

                                          So at this point what I’d need to do, ideally, is to do a Find/Replace, finding all of the new line/line feed characters - only in those now-bookmarked lines - and replace them with some other character (spaces, dummy chars, whatever) to get each of those records to be on one line. Can I do a find/replace on just the bookmarked lines? Or perhaps, instead of the multi-step approach, is there a way to do this on the Replace tab, entering a regex in the Find what box that finds those lines and just replace the new line characters with dummy characters in one step?

                                          Alan KilbornA 1 Reply Last reply Reply Quote 0
                                          • Alan KilbornA
                                            Alan Kilborn @Mark Yorkovich
                                            last edited by

                                            @Mark-Yorkovich said:

                                            Alan’s exp doesn’t match anything in my file

                                            Well, if I copy and paste your “lorem ipsum” data (above) into a new tab and then run my regex (above) on it, I get lines with exactly 9 pipes redmarked, which I thought was the goal (or the inverse of the goal):

                                            Imgur

                                            So…I really don’t know where the disconnect is…

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors