Match tags whose contents are repeated in multiple files



  • Good Day. I want to use regular expressions to match tags whose contents are repeated in multiple files. Maybe anyone help me. For example:

    <title>THIS IS THE ONE</title>

    <title>44 5464 blah blah</title>

    <title>bebe is more then a letter</title>

    <title>THIS IS THE ONE</title>

    <title>destroy the enigma64 Joker</title>

    My desire output, the result of the search in multiple files should be:

    <title>THIS IS THE ONE</title>

    <title>THIS IS THE ONE</title>



  • try something like this:

    (?s)<title>([^<]*)</title>.*?<title>[^>]*>(?!\1)[^<]*</title>



  • it’s not working



  • Do you have background in Computer Sciences? Do you know anything about algorithms?
    Finding identical elements in a large set is a difficult problem.
    Most reasonable solutions require sorting of the set so identical elements become sequential.
    Regular expressions by themselves won’t do the trick. They are only the first step of extracting tags. To find the duplicates you will need something like that.

    I don’t know awk but extrapolating from this I think that after you extract all titles into titles.txt the following may work:
    awk 'seen[$0]++ == 2' titles.txt



  • Correcting myself: I didn’t look close enough at the awk solution. I thought it prints a single copy of EVERY line but it actually already prints the 2nd instance of duplicated lines so It will work just as is.
    awk 'seen[$0]++ == 1' titles.txt



  • the question is how to use awk in windows?



  • Hello, @vasile-caraus,

    You can download some GNU tools for Win32 from the link, below :

    https://code.google.com/p/gnu-on-windows/downloads/list

    ( The downloaded GAWK version is v4.1.0 )

    The GAWK documentation may be downloaded, from the link :

    http://www.gnu.org/software/gawk/manual/

    @vasile-caraus, GAWK software is a very very powerful Unix tool, but you’ll need some time, even to learn basic functions. For instance, the PDF Reference manual is a 540 pages file ! But, I’m sure it won’t take you much time to “Google search” an short introduction to the GAWK tool !!

    Cheers,

    guy038


Log in to reply