Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Match tags whose contents are repeated in multiple files

    Help wanted · · · – – – · · ·
    notepad++ regex
    4
    7
    2367
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin Cruise
      Robin Cruise last edited by

      Good Day. I want to use regular expressions to match tags whose contents are repeated in multiple files. Maybe anyone help me. For example:

      <title>THIS IS THE ONE</title>

      <title>44 5464 blah blah</title>

      <title>bebe is more then a letter</title>

      <title>THIS IS THE ONE</title>

      <title>destroy the enigma64 Joker</title>

      My desire output, the result of the search in multiple files should be:

      <title>THIS IS THE ONE</title>

      <title>THIS IS THE ONE</title>

      1 Reply Last reply Reply Quote 0
      • Vasile Caraus
        Vasile Caraus last edited by Vasile Caraus

        try something like this:

        (?s)<title>([^<]*)</title>.*?<title>[^>]*>(?!\1)[^<]*</title>

        1 Reply Last reply Reply Quote 0
        • Robin Cruise
          Robin Cruise last edited by

          it’s not working

          1 Reply Last reply Reply Quote 0
          • gstavi
            gstavi last edited by gstavi

            Do you have background in Computer Sciences? Do you know anything about algorithms?
            Finding identical elements in a large set is a difficult problem.
            Most reasonable solutions require sorting of the set so identical elements become sequential.
            Regular expressions by themselves won’t do the trick. They are only the first step of extracting tags. To find the duplicates you will need something like that.

            I don’t know awk but extrapolating from this I think that after you extract all titles into titles.txt the following may work:
            awk 'seen[$0]++ == 2' titles.txt

            1 Reply Last reply Reply Quote 0
            • gstavi
              gstavi last edited by

              Correcting myself: I didn’t look close enough at the awk solution. I thought it prints a single copy of EVERY line but it actually already prints the 2nd instance of duplicated lines so It will work just as is.
              awk 'seen[$0]++ == 1' titles.txt

              1 Reply Last reply Reply Quote 0
              • Vasile Caraus
                Vasile Caraus last edited by Vasile Caraus

                the question is how to use awk in windows?

                1 Reply Last reply Reply Quote 0
                • guy038
                  guy038 last edited by guy038

                  Hello, @vasile-caraus,

                  You can download some GNU tools for Win32 from the link, below :

                  https://code.google.com/p/gnu-on-windows/downloads/list

                  ( The downloaded GAWK version is v4.1.0 )

                  The GAWK documentation may be downloaded, from the link :

                  http://www.gnu.org/software/gawk/manual/

                  @vasile-caraus, GAWK software is a very very powerful Unix tool, but you’ll need some time, even to learn basic functions. For instance, the PDF Reference manual is a 540 pages file ! But, I’m sure it won’t take you much time to “Google search” an short introduction to the GAWK tool !!

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 1
                  • First post
                    Last post
                  Copyright © 2014 NodeBB Forums | Contributors