Community
    • Login

    Show (or keep) subsets of a file

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 4 Posters 66 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Mark BoonieM
      Mark Boonie
      last edited by

      I have a file that contains many many blocks of text. Each block starts with a line containing <start string> and ends with a line containing <end string>. I’m only interested in those blocks that have <target string> somewhere in the block; there are far fewer occurrences of <target string> than <start string> but still too many to process manually. What I’d like to be able to do is find each occurrence of <target string>and keep only those lines from the preceding <start string> to the next <end string>. I don’t really care whether everything else is hidden or deleted. Is there a way to do something like this? Thanks for any suggestions.

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @mark-boonie and All,

        Mark, I think there is a way to achieve what you want with regular expressions !

        Could you provide one or two examples of these <start string>......<end string> blocks ?

        When posting, try to first hit the </> Code button to ensure that your text is copied literally

        See you later,

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 1
        • Mark BoonieM
          Mark Boonie
          last edited by

          Hi @guy038. I’m changing it a bit for business reasons, but basically it would look like this:

          *Block start                                                          
          00000000013FC200     00200280     00010000     00000000     00000000  
          00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
          00000000013FC220     00000000     00000000     01266100     01266100  
          00000000013FC230     00808000     013FC2B8     00000000     00000000  
          00000000013FC240     0003D000     03A8A1A0     03A8A670     03A8A710  
          00000000013FC250     00000000     0003DD88     013FD020     00000000  
          00000000013FC260     00000000     00000000     00000000     0C000002  
          00000000013FC270     11804017     03A8A718     0E000000     00800000  
          00000000013FC280     40000020     013FC280     00000000     00000000  
          00000000013FC290     00000000     00000000     00000000     00000000  
          00000000013FC2A0     00000000     0C000002     10800011     00000000  
          00000000013FC2B0     06000000     00800000     40000020     013FC2B8  
          00000000013FC2C0     00000000     00000000     00000000     00000000  
          00000000013FC320     00000000     01421800     00000000     00000000  
          00000000013FC330     00000000     00000000     00000000     00000000  
          00000000013FC3C0     00000000     00000000     00000000     00000571  
          00000000013FC3D0     00000000     00000000     00000000     00000000  
          00000000013FC3F0     00000000     00000000     00000000     00000000  
                                                                                
          *Block end                                                            
          

          The first and last lines shown are the delimiter lines. The target string would vary, but obviously it’s another hex string. Thanks for any suggestions.

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @mark-boonie,

            Hum, I’m a bit upset with the example that you provided !


            Indeed, I’ve already find out a regex solution, following exactly what you said in your initial post

            So, I created this sample of text below :

            bla bla
            
            <start string>
            
            blo blo
            
            
            <target string> 
            
            <end string>
            
            bla bal
            blah blah
            
            <start string>
            <end string>
            
            <start string>
            
            <target string> 
            
            
            blu blu
            
            <end string>
            
            bla bla
            
            <start string>
            bla bla
            blah blah
            <end string>
            
            bla bla
            
            <start string>
            <target string> 
            <end string>
            
            bla bla
            

            Now :

            • Open the Replace dialog ( Ctrl + H )

            • Uncheck any box option

            • Find (?s)^<start string>((?!<start string>).)+?<target string>.+?^<end string>\R(*SKIP)(*F)|(?-s)^.*\R

            • Replace Leave EMPTY

            • Check the Wrap around option

            • Select the Regular expression search mode

            • Click on the Replace All button

            => You should be left with this text :

            <start string>
            
            blo blo
            
            
            <target string> 
            
            <end string>
            <start string>
            
            <target string> 
            
            
            blu blu
            
            <end string>
            <start string>
            <target string> 
            <end string>
            

            You can verify that, in this OUTPUT :

            • Any text outside the blocks <start string>....<end string> have been deleted

            • Text within blocks <start string>....<end string> which do not contain the line <target string> have been deleted, too

            • It just remains blocks starting with a <start string> line and ending with a <end string> line which do contain a line <target string>

            • And note that, in this last case, all the lines of these blocks are kept, too !


            However, with the text provided :

            • The delimiters *Block start and *Block end are different than in your initial post, but this is not a problem

            • But the fact that the <target string> cannot be clearly identify is a BIG problem

            Indeed, from your example, how may I know that this block of text must be kept or not ??

            I probably miss something …

            BR

            guy038

            CoisesC 1 Reply Last reply Reply Quote 0
            • CoisesC
              Coises @guy038
              last edited by

              @Mark-Boonie said in Show (or keep) subsets of a file:

              I’m changing it a bit for business reasons

              @guy038 said:

              I probably miss something …

              I think the problem is that the block delimiters and target strings contain sensitive information and it would be irresponsible for the original poster to reproduce them in a public forum.

              @Mark-Boonie said:

              The first and last lines shown are the delimiter lines.

              Many characters have special meanings in regular expressions, so we must be careful telling you how to represent your strings if we don’t know exactly what they are.

              Can you tell us whether the delimiter strings are always the same? Do they always start at the beginning of a new line?

              If each delimiter string is not exactly the same, every time, we need to know enough details to determine the patterns they follow that differentiate them from other lines in the file.

              Do the delimiter strings contain any characters other than letters, numbers, spaces? Rather than get into every last detail, if they do contain characters other than letters, numbers and spaces, could they ever contain the specific sequence \E (backslash followed by capital E)? If they cannot, the strings can be enclosed in a \Q …\E pair to “quote” them so there is no need to worry about exactly what special characters need escaping.

              @Mark-Boonie said:

              The target string would vary, but obviously it’s another hex string.

              A hex string (involving only letters, numbers and possibly spaces) will be no problem. But we need to be precise about what “would vary” means.

              Do you mean that each time you do this search, you will start with a copy of the whole file and search for a specific target string?

              Or do you mean that there will be several different target strings, and you will want to get all the blocks that contain any of them into a single file?

              Or something else?

              @guy038 is already on this; I have no other recommendations. All I intend with this post is to clarify what information will be needed to implement his solution in the specific case at hand.

              1 Reply Last reply Reply Quote 0
              • Mark BoonieM
                Mark Boonie
                last edited by

                I think what you provided should work – the syntax “<xxx>” means that I would replace that string with the specific string I was interested in, because at different times I’d be searching for different strings.

                However, I just tried your search string with two different substitutions without success. So, let’s look at simple example. Start with this file:

                *Block start                                                          
                00000000013FC200     00200280     00010000     00000000     00000000  
                00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                                                                                      
                *Block end  
                
                Extra stuff
                
                *Block start                                                          
                00000000013FC200     00200280     00010000     00000000     00000000  
                00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                00000000013FC220     00000000     00000000     01266100     01266100  
                00000000013FC230     00808000     013FC2B8     00000000     00000000  
                
                *Block end  
                
                *Block start                                                          
                00000000013FC200     00200280     00020000     00000000     00000000  
                00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                00000000013FC220     00000000     00000000     01266100     01266100  
                00000000013FC230     00808000     013FC2B8     00000000     00000000  
                
                *Block end  
                
                *Block start                                                          
                00000000013FC200     00200280     00030000     00000000     00000000  
                00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                00000000013FC220     00000000     00000000     01266100     01266100  
                00000000013FC230     00808000     013FC2B8     00000000     00000000  
                
                *Block end  
                
                Extra stuff
                
                *Block start                                                          
                00000000013FC200     00200280     00010000     00000000     00000000  
                *Block end  
                

                In this case, the blocks that I want to find are those that contain “80 00010000”. This target occurs in blocks 1, 2, and 5. The string I’m searching for is:

                (?s)^*Block start((?!*BLOCK start).)+?80     00010000.+?^*Block end\R(*SKIP)(*F)|(?-s)^.*\R
                

                But when I fill out the Replace dialog box and select Replace All, I get an error that the regular expression is invalid. I don’t use regular expressions so I’m not sure what the errors are, but I suspect it has to do with the asterisks specified in the start and end strings. I found that they need to be escaped with ‘’ , resulting in:

                (?s)^\*Block start((?!\*BLOCK start).)+?80     00010000.+?^\*Block end\R(*SKIP)(*F)|(?-s)^.*\R
                

                This gave me a message that 36 occurrences were replaced, but the only thing I was left with was a single line containing “*Block end”. Are there other modifications to the search string that I need to make?

                1 Reply Last reply Reply Quote 0
                • Mark BoonieM
                  Mark Boonie
                  last edited by Mark Boonie

                  @Coises posed several questions that I’ll answer here:

                  Can you tell us whether the delimiter strings are always the same? Do they always start at the beginning of a new line?

                  For a single search, the delimiter strings for all blocks to be found are the same, not “either string1 or string2”. I would change the delimiter strings for different invocations. (“Yesterday I searched for blocks that were delimited by string1 and string2, but today, in another file, I need to search for blocks delimited by string3 and string4.”)
                  [Start of edit]
                  And yes, they are always at the beginning of a new line. It wasn’t clear to me if I could specify only enough to make the string unique or if I had to specify the entire line contents.
                  [End of edit]

                  If each delimiter string is not exactly the same, every time, we need to know enough details to determine the patterns they follow that differentiate them from other lines in the file.

                  I assume(!) that using the entire line for the start string and the end string should be sufficient; otherwise the file would be ill-formed because the delimiter strings wouldn’t really be delimiters. But yes, I see your next point.

                  Do the delimiter strings contain any characters other than letters, numbers, spaces? Rather than get into every last detail, if they do contain characters other than letters, numbers and spaces, could they ever contain the specific sequence \E (backslash followed by capital E)? If they cannot, the strings can be enclosed in a \Q …\E pair to “quote” them so there is no need to worry about exactly what special characters need escaping.

                  They would contain an asterisk, if I have to specify the entire line. If only a unique portion of the line is needed then I can skip specifying the asterisk, but I’m not sure if the syntax of the regular expression is trying to match the entire line or not. But they would never contain a backslash, so “\E” would not be in the delimiters (or in the blocks themselves). The \Q…\E syntax is something I’ll probably use every time, whether it’s needed or not.

                  A hex string (involving only letters, numbers and possibly spaces) will be no problem. But we need to be precise about what “would vary” means.

                  It would be your first case below, where I would have a single start/end string for each search, not the “either string1 or string2” case.

                  Do you mean that each time you do this search, you will start with a copy of the whole file and search for a specific target string?

                  Yes.

                  Or do you mean that there will be several different target strings, and you will want to get all the blocks that contain any of them into a single file?

                  No.

                  Or something else?

                  No.

                  1 Reply Last reply Reply Quote 0
                  • Mark BoonieM
                    Mark Boonie
                    last edited by

                    I noticed that I had misspecified one of the occurrences of the starting string – I had used uppercase where the starting string did not. I retried the search using the correct case and also the \Q…\E syntax mentioned by @Coises :

                    (?s)^\Q*Block start\E((?!\Q*Block start\E).)+?\Q80     00010000\E.+?^\Q*Block end\E\R(*SKIP)(*F)|(?-s)^.*\R
                    

                    But I got the same result where 36 occurrences were replaced and the resulting file contained only a single line of “*Block end”.

                    1 Reply Last reply Reply Quote 0
                    • Mark BoonieM
                      Mark Boonie
                      last edited by

                      Ooh, getting close. Apparently some of the lines had trailing blanks on them. (I inadvertently added them when I was “sanitizing” the file so I could post it.) After removing the trailing blanks, the previous search I showed correctly identified blocks 1 and 2, but it did not identify block 5. I’m trying to parse the regular expression to see why, but the learning curve is steep…

                      1 Reply Last reply Reply Quote 0
                      • Mark BoonieM
                        Mark Boonie
                        last edited by

                        Okay, I identified a bypass. As long as the ending string delimiting the last block isn’t the last line in the file, all blocks are located. I can make sure I add a trailing line before I run the Replace, so I shouldn’t have a problem. Thanks for your help. Back to work…

                        PeterJonesP 1 Reply Last reply Reply Quote 0
                        • PeterJonesP
                          PeterJones @Mark Boonie
                          last edited by

                          @Mark-Boonie said in Show (or keep) subsets of a file:

                          As long as the ending string delimiting the last block isn’t the last line in the file

                          use (\R|\Z) at the end to allow the match to end with a newline or the end of the file.

                          Most regulars here assume that all lines end with a newline

                          Mark BoonieM 1 Reply Last reply Reply Quote 0
                          • Mark BoonieM
                            Mark Boonie @PeterJones
                            last edited by

                            @PeterJones said in Show (or keep) subsets of a file:

                            (\R|\Z)

                            It didn’t quite work, @PeterJones, although it’s certainly possible that I messed up the syntax. I used this search string:

                            (?s)^\Q*Block start\E((?!\Q*Block start\E).)+?\Q80     00010000\E.+?^\Q*Block end\E\R(*SKIP)(*F)|(?-s)^.*(\R|\Z)
                            

                            And this file:

                            *Block start
                            00000000013FC200     00200280     00010000     00000000     00000001  
                            00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                                                                                                  
                            *Block end
                            
                            Extra stuff
                            
                            *Block start
                            00000000013FC200     00200280     00010000     00000000     00000002  
                            00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                            00000000013FC220     00000000     00000000     01266100     01266100  
                            00000000013FC230     00808000     013FC2B8     00000000     00000000  
                            
                            *Block end
                            
                            *Block start
                            00000000013FC200     00200280     00020000     00000000     00000003  
                            00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                            00000000013FC220     00000000     00000000     01266100     01266100  
                            00000000013FC230     00808000     013FC2B8     00000000     00000000  
                            
                            *Block end
                            
                            *Block start
                            00000000013FC200     00200280     00030000     00000000     00000004  
                            00000000013FC210     00000002     CC5CDDA0     00000000     00000000  
                            00000000013FC220     00000000     00000000     01266100     01266100  
                            00000000013FC230     00808000     013FC2B8     00000000     00000000  
                            
                            *Block end
                            
                            Extra stuff
                            
                            *Block start
                            00000000013FC200     00200280     00010000     00000000     00000005  
                            *Block end
                            
                            *Block start
                            00000000013FC200     00200280     00010000     00000000     00000006
                            *Block end
                            

                            Note that the last delimited block, with the ‘6’ as the last character before the ending string, is not found but should be.

                            PeterJonesP 1 Reply Last reply Reply Quote 0
                            • PeterJonesP
                              PeterJones @Mark Boonie
                              last edited by

                              @Mark-Boonie ,

                              Sorry, I hadn’t noticed there was more than one \R in the original regex. You would have to use the alternate just before the (*SKIP) as well.

                              Mark BoonieM 1 Reply Last reply Reply Quote 0
                              • Mark BoonieM
                                Mark Boonie @PeterJones
                                last edited by

                                @PeterJones - Perfect! Thanks, everyone, for your help.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors