• Login
Community
  • Login

Match tags whose contents are repeated in multiple files

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
regex
7 Posts 4 Posters 2.7k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Apr 6, 2017, 11:19 AM

    Good Day. I want to use regular expressions to match tags whose contents are repeated in multiple files. Maybe anyone help me. For example:

    <title>THIS IS THE ONE</title>

    <title>44 5464 blah blah</title>

    <title>bebe is more then a letter</title>

    <title>THIS IS THE ONE</title>

    <title>destroy the enigma64 Joker</title>

    My desire output, the result of the search in multiple files should be:

    <title>THIS IS THE ONE</title>

    <title>THIS IS THE ONE</title>

    1 Reply Last reply Reply Quote 0
    • V
      Vasile Caraus
      last edited by Vasile Caraus Apr 8, 2017, 8:08 PM Apr 8, 2017, 8:06 PM

      try something like this:

      (?s)<title>([^<]*)</title>.*?<title>[^>]*>(?!\1)[^<]*</title>

      1 Reply Last reply Reply Quote 0
      • R
        Robin Cruise
        last edited by Apr 9, 2017, 11:29 AM

        it’s not working

        1 Reply Last reply Reply Quote 0
        • G
          gstavi
          last edited by gstavi Apr 9, 2017, 12:24 PM Apr 9, 2017, 12:24 PM

          Do you have background in Computer Sciences? Do you know anything about algorithms?
          Finding identical elements in a large set is a difficult problem.
          Most reasonable solutions require sorting of the set so identical elements become sequential.
          Regular expressions by themselves won’t do the trick. They are only the first step of extracting tags. To find the duplicates you will need something like that .

          I don’t know awk but extrapolating from this I think that after you extract all titles into titles.txt the following may work:
          awk 'seen[$0]++ == 2' titles.txt

          1 Reply Last reply Reply Quote 0
          • G
            gstavi
            last edited by Apr 9, 2017, 2:03 PM

            Correcting myself: I didn’t look close enough at the awk solution . I thought it prints a single copy of EVERY line but it actually already prints the 2nd instance of duplicated lines so It will work just as is.
            awk 'seen[$0]++ == 1' titles.txt

            1 Reply Last reply Reply Quote 0
            • V
              Vasile Caraus
              last edited by Vasile Caraus Apr 9, 2017, 2:41 PM Apr 9, 2017, 2:40 PM

              the question is how to use awk in windows?

              1 Reply Last reply Reply Quote 0
              • G
                guy038
                last edited by guy038 Apr 12, 2017, 6:14 PM Apr 9, 2017, 7:13 PM

                Hello, @vasile-caraus,

                You can download some GNU tools for Win32 from the link, below :

                https://code.google.com/p/gnu-on-windows/downloads/list

                ( The downloaded GAWK version is v4.1.0 )

                The GAWK documentation may be downloaded, from the link :

                http://www.gnu.org/software/gawk/manual/

                @vasile-caraus, GAWK software is a very very powerful Unix tool, but you’ll need some time, even to learn basic functions. For instance, the PDF Reference manual is a 540 pages file ! But, I’m sure it won’t take you much time to “Google search” an short introduction to the GAWK tool !!

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 1
                6 out of 7
                • First post
                  6/7
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors