Community
    • 登入

    Match tags whose contents are repeated in multiple files

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    regex
    7 貼文 4 Posters 2.8k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • Robin CruiseR
      Robin Cruise
      最後由 編輯

      Good Day. I want to use regular expressions to match tags whose contents are repeated in multiple files. Maybe anyone help me. For example:

      <title>THIS IS THE ONE</title>

      <title>44 5464 blah blah</title>

      <title>bebe is more then a letter</title>

      <title>THIS IS THE ONE</title>

      <title>destroy the enigma64 Joker</title>

      My desire output, the result of the search in multiple files should be:

      <title>THIS IS THE ONE</title>

      <title>THIS IS THE ONE</title>

      1 條回覆 最後回覆 回覆 引用 0
      • Vasile CarausV
        Vasile Caraus
        最後由 Vasile Caraus 編輯

        try something like this:

        (?s)<title>([^<]*)</title>.*?<title>[^>]*>(?!\1)[^<]*</title>

        1 條回覆 最後回覆 回覆 引用 0
        • Robin CruiseR
          Robin Cruise
          最後由 編輯

          it’s not working

          1 條回覆 最後回覆 回覆 引用 0
          • gstaviG
            gstavi
            最後由 gstavi 編輯

            Do you have background in Computer Sciences? Do you know anything about algorithms?
            Finding identical elements in a large set is a difficult problem.
            Most reasonable solutions require sorting of the set so identical elements become sequential.
            Regular expressions by themselves won’t do the trick. They are only the first step of extracting tags. To find the duplicates you will need something like that.

            I don’t know awk but extrapolating from this I think that after you extract all titles into titles.txt the following may work:
            awk 'seen[$0]++ == 2' titles.txt

            1 條回覆 最後回覆 回覆 引用 0
            • gstaviG
              gstavi
              最後由 編輯

              Correcting myself: I didn’t look close enough at the awk solution. I thought it prints a single copy of EVERY line but it actually already prints the 2nd instance of duplicated lines so It will work just as is.
              awk 'seen[$0]++ == 1' titles.txt

              1 條回覆 最後回覆 回覆 引用 0
              • Vasile CarausV
                Vasile Caraus
                最後由 Vasile Caraus 編輯

                the question is how to use awk in windows?

                1 條回覆 最後回覆 回覆 引用 0
                • guy038G
                  guy038
                  最後由 guy038 編輯

                  Hello, @vasile-caraus,

                  You can download some GNU tools for Win32 from the link, below :

                  https://code.google.com/p/gnu-on-windows/downloads/list

                  ( The downloaded GAWK version is v4.1.0 )

                  The GAWK documentation may be downloaded, from the link :

                  http://www.gnu.org/software/gawk/manual/

                  @vasile-caraus, GAWK software is a very very powerful Unix tool, but you’ll need some time, even to learn basic functions. For instance, the PDF Reference manual is a 540 pages file ! But, I’m sure it won’t take you much time to “Google search” an short introduction to the GAWK tool !!

                  Cheers,

                  guy038

                  1 條回覆 最後回覆 回覆 引用 1
                  • 第一個貼文
                    最後的貼文
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors