Community
    • Login

    Match tags whose contents are repeated in multiple files

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regex
    7 Posts 4 Posters 2.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by

      Good Day. I want to use regular expressions to match tags whose contents are repeated in multiple files. Maybe anyone help me. For example:

      <title>THIS IS THE ONE</title>

      <title>44 5464 blah blah</title>

      <title>bebe is more then a letter</title>

      <title>THIS IS THE ONE</title>

      <title>destroy the enigma64 Joker</title>

      My desire output, the result of the search in multiple files should be:

      <title>THIS IS THE ONE</title>

      <title>THIS IS THE ONE</title>

      1 Reply Last reply Reply Quote 0
      • Vasile CarausV
        Vasile Caraus
        last edited by Vasile Caraus

        try something like this:

        (?s)<title>([^<]*)</title>.*?<title>[^>]*>(?!\1)[^<]*</title>

        1 Reply Last reply Reply Quote 0
        • Robin CruiseR
          Robin Cruise
          last edited by

          it’s not working

          1 Reply Last reply Reply Quote 0
          • gstaviG
            gstavi
            last edited by gstavi

            Do you have background in Computer Sciences? Do you know anything about algorithms?
            Finding identical elements in a large set is a difficult problem.
            Most reasonable solutions require sorting of the set so identical elements become sequential.
            Regular expressions by themselves won’t do the trick. They are only the first step of extracting tags. To find the duplicates you will need something like that.

            I don’t know awk but extrapolating from this I think that after you extract all titles into titles.txt the following may work:
            awk 'seen[$0]++ == 2' titles.txt

            1 Reply Last reply Reply Quote 0
            • gstaviG
              gstavi
              last edited by

              Correcting myself: I didn’t look close enough at the awk solution. I thought it prints a single copy of EVERY line but it actually already prints the 2nd instance of duplicated lines so It will work just as is.
              awk 'seen[$0]++ == 1' titles.txt

              1 Reply Last reply Reply Quote 0
              • Vasile CarausV
                Vasile Caraus
                last edited by Vasile Caraus

                the question is how to use awk in windows?

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @vasile-caraus,

                  You can download some GNU tools for Win32 from the link, below :

                  https://code.google.com/p/gnu-on-windows/downloads/list

                  ( The downloaded GAWK version is v4.1.0 )

                  The GAWK documentation may be downloaded, from the link :

                  http://www.gnu.org/software/gawk/manual/

                  @vasile-caraus, GAWK software is a very very powerful Unix tool, but you’ll need some time, even to learn basic functions. For instance, the PDF Reference manual is a 540 pages file ! But, I’m sure it won’t take you much time to “Google search” an short introduction to the GAWK tool !!

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 1
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors