Community
    • Login

    Remove entries from second file

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 5 Posters 4.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Michael RebusifyM
      Michael Rebusify
      last edited by

      Hi,

      1. Yes, there are 2000 url’s in 2.text and they are all on a line and unique.
      2. Yes, all on each line.
      3. If I combine by putting all of 2.txt at the bottom of 1.txt then we’d have to remove all that were duplicates. That would remove the initial URL and the second one.

      I have to remove 10,000 URL’s but need to keep 2000 of them (2.txt).

      1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn
        last edited by Alan Kilborn

        So if one considers the data:

        one
        two
        three
        four
        five
        six
        seven
        eight
        nine
        ten
        -------
        five
        seven
        

        Note that the line of dashes is just there as a visual divider between 2 sections; here the sections could be considered the first, larger file at the top, and the second, smaller file at the bottom. Each section contains unique lines, but obviously there is going to be commonality between the sections.

        So a regular expression replacement operation using ^(.+?\R)(?=(?s).*?\1) as the search expression and an empty replace expression seems to remove the content of the bottom section that also appears in the top section, leaving the bottom section intact:

        one
        two
        three
        four
        six
        eight
        nine
        ten
        -------
        five
        seven
        

        So, in theory, if one combines into one file the 10000 line section from the first file (placed at the top in the new file) and the 2000 line section (placed at the bottom in the new file) and runs the above replacement on the new file, it should do the job?

        Obviously, when the operation is complete, copy the top 8000 lines of the file to whatever file you need it in.

        1 Reply Last reply Reply Quote 2
        • Michael RebusifyM
          Michael Rebusify
          last edited by

          Thank you!

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hello, @michael-rebusify, @terry-r, @alan-kilborn and All,

            I was waiting the Terry’s reply, first, but, in the meanwhile, I already imagined the suitable regex, a bit longer, which could handle special cases as, for instance, duplicate lines in the first part, before the separation line and no similar line in the second part, which, obviously, should not be considered !

            As you replied to @michael-rebusify, and now that we assume that no duplicate line exist, in each section, here is my shortened solution :

            SEARCH (?-s)^(.+)\R(?s)(?=.*^\1(\R|\z))|^%%%.+

            REPLACE Leave EMPTY

            Let’s examine, Alan, the differences with your search regex ^(.+?\R)(?=(?s).*?\1) :

            • Firstly, note that I added the alternative ^%%%.+, which grasps, after the first part, all the second section from the separation line, included, which must be deleted too !

            • Secondly, I changed the \1 syntax with ^\1(\R|\z), which forces a line, in the first section to have an exact equivalent in the second section, even if the last line of the 2nd section does not end with a line-break. See the example, below, to easily pin down the differences ;-))

            • Thirdly, it was necessary to place the \R syntax, outside the group 1, to be able to include the \z assertion

            • Fourthly, in order that my second alternative has the implicit (?s) modifier, I needed to place the prior (?s) before the positive look-ahead structure !


            So, let’s consider a new N++ tab containing :

            • All the File_1 contents

            • A line of, at least, 3 percent characters

            • All the File_2 contents, with the last line possibly without any line-break

            Here is an example data :

            one
            two
            three
            four
            five
            six
            seven
            eight
            nine
            ten
            twenty-two
            %%%%%%%%%%
            twenty-two
            five
            nineteen
            seven
            

            After the regex S/R (?-s)^(.+)\R(?s)(?=.*^\1(\R|\z))|^%%%.+, you should get the expected contents of the new File_1 :

            one
            two
            three
            four
            six
            eight
            nine
            ten
            

            With your version, Alan, ^(.+?\R)(?=(?s).*?\1), assuming the second seven word is the very end of file, you would have obtained, after replacement :

            one
            three
            four
            six
            eight
            nine
            ten
            %%%%%%%%%%
            twenty-two
            five
            nineteen
            seven
            

            Best regards,

            guy038

            Terry RT 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn
              last edited by

              Sometimes I think we run the risk of “oversolving” and confusing an OP. Oftentimes it is best not to read extra things into a specification, especially when the original is described very well.

              1 Reply Last reply Reply Quote 2
              • guy038G
                guy038
                last edited by guy038

                Hi, @alan-kilborn,

                I totally agree with your statement. Nevertheless, your regex would be more exact, just adding one more ^ symbol, giving :

                ^(.+?\R)(?=(?s).*?^\1)

                Test it against the very simple text, below :

                two
                ---
                twenty-two
                

                The version \1 would wrongly select the word two, where as the version ^\1 correctly does not find any occurrence ;-))

                Cheers,

                guy038

                Alan KilbornA 1 Reply Last reply Reply Quote 2
                • Terry RT
                  Terry R @guy038
                  last edited by

                  @guy038 said in Remove entries from second file:

                  I was waiting the Terry’s reply, first

                  @guy038 you don’t need to await my reply first. Sure I had intended to give a regex answer but other matters got in the way. I knew that those questions needed to be asked in order to get a full and correct understanding of the problem.

                  My regex would have been similar, although given the lookahead can be problematic with large amounts of data I try to avoid that situation like the plague. I would have simply combined both files and sorted, thus putting 2 “same” lines together and used a regex to remove both. If ordering was needed to be kept line numbering would have been an option although it creates more steps overall.

                  Good thing is the result the OP was seeking has been achieved and that’s all that matters.

                  Cheers
                  Terry

                  1 Reply Last reply Reply Quote 2
                  • Alan KilbornA
                    Alan Kilborn @guy038
                    last edited by

                    @guy038

                    \1 would wrongly select

                    Yes. Perhaps I should have invented some dummy url data instead of a simple number-to-word list; if I had done so there would not have been any entry that would have been a subset of another entry, like “two” is a subset of “twenty-two”. Again, I was considering the OP’s well-stated problem case.

                    1 Reply Last reply Reply Quote 0
                    • Ninon_1977N
                      Ninon_1977
                      last edited by

                      @Michael-Rebusify said in Remove entries from second file:

                      0 URL’s which contain 2000 that I need to remove. The 2000 that needs to be removed are in 2.txt.
                      Is it possible to run some kind of search/replace to perform this actions? Thanks in advance!

                      Hi Michael,
                      First I would copy your file, then I would record a macro where you delete every second row.
                      You can then run the macro till the end of the file…
                      Does that help you?

                      1 Reply Last reply Reply Quote -1
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @terry-r and All,

                        I said :

                        I was waiting the Terry’s reply, first,…

                        Because I think it is more fair to let the first guy, helping the OP, to develop his solution ;-)) Then, you can jump into the discussion, proposing alternate solutions, too !

                        Besides, I know that I’m really too eager to give my regex solutions and, very often, I must prevent some people from helping their own solutions ;-))


                        Now, regarding your solution, you’re right about it : best to avoid large amounts of data inside the look-ahead structure ;-))

                        So, from the short example data, given in my previous post :

                        one
                        two
                        three
                        four
                        five
                        six
                        seven
                        eight
                        nine
                        ten
                        twenty-two
                        %%%%%%%%%%
                        twenty-two
                        five
                        nineteen
                        seven
                        

                        First, using the Edit > Column Editor... option, at column 12 or more, we would get :

                        one            01
                        two            02
                        three          03
                        four           04
                        five           05
                        six            06
                        seven          07
                        eight          08
                        nine           09
                        ten            10
                        twenty-two     11
                        %%%%%%%%%%     12
                        twenty-two     13
                        five           14
                        nineteen       15
                        seven          16
                        

                        And , after the Edit > Line Operations > Sort Lines Lexicographically Ascending option, we have :

                        %%%%%%%%%%     12
                        eight          08
                        five           05
                        five           14
                        four           04
                        nine           09
                        nineteen       15
                        one            01
                        seven          07
                        seven          16
                        six            06
                        ten            10
                        three          03
                        twenty-two     11
                        twenty-two     13
                        two            02
                        

                        Now, using the following regex S/R :

                        SEARCH (?-s)^(.+)\x20+\d+\R\1\x20+\d+\R? OR (?-s)^(.+)(\x20+\d+\R?)\1(?2)

                        REPLACE Leave EMPTY

                        We are left with :

                        %%%%%%%%%%     12
                        eight          08
                        four           04
                        nine           09
                        nineteen       15
                        one            01
                        six            06
                        ten            10
                        three          03
                        two            02
                        

                        Then, moving back the numbers from the end to the beginning of line and adding a space column, with the column mode selection, we would obtain :

                        12 %%%%%%%%%%     
                        08 eight          
                        04 four           
                        09 nine           
                        15 nineteen       
                        01 one            
                        06 six            
                        10 ten            
                        03 three          
                        02 two            
                        

                        And, after a last ascending sort, we have :

                        01 one            
                        02 two            
                        03 three          
                        04 four           
                        06 six            
                        08 eight          
                        09 nine           
                        10 ten            
                        12 %%%%%%%%%%     
                        15 nineteen       
                        

                        Finally, after processing this last regex S/R, we get our expected text, removing leading numbers and trailing spaces as well as anything from the separation line till the very end of file :

                        SEARCH (?s)^\d+\h*%%%+.+|^\d+\h*|\h+$

                        REPLACE Leave EMPTY

                        one
                        two
                        three
                        four
                        six
                        eight
                        nine
                        ten
                        

                        Cheers,

                        guy038

                        1 Reply Last reply Reply Quote 3
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors