Community
    • 登入

    Remove unwanted Carriage Return

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    27 貼文 6 Posters 13.6k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • Alexander HelA
      Alexander Hel
      最後由 編輯

      Hi, first thank you for your help.

      I’m reading a chinese novel using translator tool but the text file has unwanted Carriage Return for a long paragraph, it split the paragraph to multiple one. Hope someone can help me solve this problems

      alt text

      Neil SchipperN 1 條回覆 最後回覆 回覆 引用 0
      • Neil SchipperN
        Neil Schipper @Alexander Hel
        最後由 Neil Schipper 編輯

        @alexander-hel @alexander-hel I’m going to assume there are no blank line between text lines that should be within a paragraph, and, that there is a blank line between paragraphs that you want preserved. If so, we can replace each single newline that occurs between non-newline text with a space.

        Use Ctl-h to invoke Replace dialog.

        Search mode: Regular Expression
        Find: ([\w|[[:punct:]]|[[:graph:]])\R([\w|[[:punct:]]|[[:graph:]])
        Repl: $1 $2 (if lines have no trailing space)
        Repl: $1$2 (if translator left in a trailing space)

        Now use Replace All to process the whole file, or Replace if you want to see how it works line by line. Ctl-z to undo.

        I expect there are more elegant solutions.

        If this does not meet your needs, you should supply sample text, preferably in a Literal Text Block as explained here.

        Neil SchipperN 1 條回覆 最後回覆 回覆 引用 1
        • Neil SchipperN
          Neil Schipper @Neil Schipper
          最後由 編輯

          A perhaps more elegant find expression is (?<=[^\r\n])\R(?=[^\r\n]) in which case replacement text is nothing or a single space.

          To the regular crowd of regex experts: I was thrown by the fact that these are invalid expressions:

          (?<!\R)\R(?!\R)
          (?<[^\R])\R(?=[^\R])

          My best explanation is that \R can be either 1 or 2 bytes, but a look-behind must be fixed width. Any comment?

          Alexander HelA astrosofistaA 2 條回覆 最後回覆 回覆 引用 1
          • Alexander HelA
            Alexander Hel @Neil Schipper
            最後由 編輯

            @neil-schipper said in Remove unwanted Carriage Return:

            A perhaps more elegant find expression is (?<=[^\r\n])\R(?=[^\r\n]) in which case replacement text is nothing or a single space.

            To the regular crowd of regex experts: I was thrown by the fact that these are invalid expressions:

            (?<!\R)\R(?!\R)
            (?<[^\R])\R(?=[^\R])

            My best explanation is that \R can be either 1 or 2 bytes, but a look-behind must be fixed width. Any comment?

            Hi, thank you for your reply, I tried this expression (?<=[^\r\n])\R(?=[^\r\n]) and it worked flawlessly.

            Thank you so much

            1 條回覆 最後回覆 回覆 引用 1
            • guy038G
              guy038
              最後由 guy038 編輯

              Hello, @alexander-hel, @neil-schipper, and All,

              Yes, Neil, you’re right : just because of needed fixed length in look-behinds, the form (?<!\R)\R(?!\R) is invalid as \R may match one or two consecutive chars

              As for your second attempt, the parts [^\R] just match any character

              • But the uppercase R if the Match case option is set

              • But the letters R and r if the Match case option is not set


              Now a solution is to use a fixed length in the look-behind part :

              SEARCH (?<!\n|\r)\R(?!\R)

              Note that, if the last line ends with a line-break, it’s also matched. Indeed, the part (?!\R) is followed with nothing which is obviously different from \R )

              Best Regards,

              guy038

              Neil SchipperN 2 條回覆 最後回覆 回覆 引用 2
              • guy038G
                guy038
                最後由 編輯

                Hi, @neil-schipper and All,

                Oh…, I didn’t see that you already gave the right solution to the OP :

                (?<=[^\r\n])\R(?=[^\r\n])

                Which is equivalent to my solution :

                (?<!\n|\r)\R(?!\R)

                BR

                guy038

                1 條回覆 最後回覆 回覆 引用 1
                • Neil SchipperN
                  Neil Schipper @guy038
                  最後由 編輯

                  @guy038 Thanks, Guy.

                  I’m noticing now why the 2nd regex I presented as invalid, (?<[^\R])\R(?=[^\R]), is invalid: bad syntax due to missing = after <.

                  After that fix, it’s merely wrong (since the fancy \R construct is not decoded when appearing inside [] as you pointed out).

                  Best,
                  Neil

                  1 條回覆 最後回覆 回覆 引用 0
                  • astrosofistaA
                    astrosofista @Neil Schipper
                    最後由 編輯

                    @neil-schipper said in Remove unwanted Carriage Return:

                    (?<!\R)\R(?!\R)
                    (?<[^\R])\R(?=[^\R])
                    My best explanation is that \R can be either 1 or 2 bytes, but a look-behind must be fixed width. Any comment?

                    Since a negative look-behind must be fixed width, the way I use to get a valid expression is to replace \R by a \v or vertical tab, as follows:

                    (?<!\v)\R(?!\R)

                    Neil SchipperN 1 條回覆 最後回覆 回覆 引用 2
                    • Neil SchipperN
                      Neil Schipper @astrosofista
                      最後由 編輯

                      @astrosofista said:

                      replace \R by a \v

                      Confirmed, and good to know, thanks. It’s strange that both constructs encode the 1 or 2 byte newline sequences, but only \v is valid.

                      a negative look-behind must be fixed width

                      Both positive and negative look-behinds have this limitation (perhaps what you meant to say).

                      PeterJonesP astrosofistaA 2 條回覆 最後回覆 回覆 引用 0
                      • PeterJonesP
                        PeterJones @Neil Schipper
                        最後由 編輯

                        @neil-schipper said in Remove unwanted Carriage Return:

                        Both positive and negative look-behinds have this limitation (perhaps what you meant to say).

                        Nope.

                        https://www.boost.org/doc/libs/1_78_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.lookahead

                        vs

                        https://www.boost.org/doc/libs/1_78_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.lookbehind

                        Notice that only the lookbehind has the pattern must be of fixed length caveat; the lookahead can be variable width.

                        And it’s very easy to test:
                        lookbehind
                        c87c5b67-366e-4658-b3f6-cb93886d04c5-image.png

                        lookahead
                        701eabc7-af01-412e-a0ca-23a0afa9b10b-image.png

                        Neil SchipperN 1 條回覆 最後回覆 回覆 引用 0
                        • Neil SchipperN
                          Neil Schipper @PeterJones
                          最後由 編輯

                          @peterjones My comment pertains to the two kinds of look-behind, positive (?<= and negative (?<! (and I’ve tested both), and makes no mention of the two kinds of look-aheads.

                          PeterJonesP 1 條回覆 最後回覆 回覆 引用 1
                          • Alan KilbornA
                            Alan Kilborn
                            最後由 Alan Kilborn 編輯

                            replace \R by a \v

                            This doesn’t make sense in the desired usage above.
                            A \R can be one or two characters, thus not fixed length.
                            A \v is always only one character, when it matches.

                            Neil SchipperN 1 條回覆 最後回覆 回覆 引用 0
                            • Neil SchipperN
                              Neil Schipper @Alan Kilborn
                              最後由 編輯

                              @alan-kilborn said in Remove unwanted Carriage Return:

                              A \v is always only one character, when it matches.

                              Good point. But @astrosofista’s trick does work with conventional UTF-8 \r\n line endings; it’s a more tolerant way of simply specifying \n, and would also handle other non-standard line ending formats.

                              Your comment reminds me that it’s risky to get too used to throwing \v around: if used in a matched text expression which will be replaced, it’s easy to inadvertently destroy a pristine file’s uniform line endings.

                              Alan KilbornA 1 條回覆 最後回覆 回覆 引用 0
                              • PeterJonesP
                                PeterJones @Neil Schipper
                                最後由 編輯

                                @neil-schipper ,

                                Sorry, I misread. Time to stop trying to think for the evening, apparently

                                1 條回覆 最後回覆 回覆 引用 1
                                • Alan KilbornA
                                  Alan Kilborn @Neil Schipper
                                  最後由 編輯

                                  @neil-schipper said in Remove unwanted Carriage Return:

                                  @astrosofista’s \v trick does work with conventional UTF-8 \r\n line endings; it’s a more tolerant way of simply specifying \n, and would also handle other non-standard line ending formats.

                                  I don’t think it has value. If you need to match \r\n then \v\v will match it, but it will also match other things (that maybe aren’t wanted), so…no real point in it.

                                  1 條回覆 最後回覆 回覆 引用 0
                                  • astrosofistaA
                                    astrosofista @Neil Schipper
                                    最後由 編輯

                                    @neil-schipper said in Remove unwanted Carriage Return:

                                    Both positive and negative look-behinds have this limitation (perhaps what you meant to say).

                                    Yes, that’s what I meant. My apologies for the possible misunderstanding.

                                    Anyway, although both lookbehinds share such limitation, they differ because in the case of the positive lookbehind we have at our disposal the alternative of the operator \K, but no operator for the negative one.

                                    It would be nice if this issue could be fixed sometime.

                                    Alan KilbornA 1 條回覆 最後回覆 回覆 引用 0
                                    • Alan KilbornA
                                      Alan Kilborn @astrosofista
                                      最後由 編輯

                                      @astrosofista said in Remove unwanted Carriage Return:

                                      It would be nice if this issue could be fixed sometime.

                                      It’s just a coincidence that \K can be used as a variable length positive lookbehind.
                                      Because there is no equivalent for a negative lookbehind, doesn’t mean there’s an issue that could be “fixed”.

                                      PeterJonesP 1 條回覆 最後回覆 回覆 引用 0
                                      • PeterJonesP
                                        PeterJones @Alan Kilborn
                                        最後由 編輯

                                        @alan-kilborn said in Remove unwanted Carriage Return:

                                        @astrosofista said in Remove unwanted Carriage Return:

                                        It would be nice if this issue could be fixed sometime.

                                        Because there is no equivalent for a negative lookbehind, doesn’t mean there’s an issue that could be “fixed”.

                                        The lookbehind fixed-length restriction is caused by the Boost::regex library, not by anything Notepad++ does. So the issue would have to be fixed there.

                                        To find more of the history of Boost::regex, and how long they’ve known about that restriction, I went to https://www.boost.org/doc/libs/ (I kept cutting stuff out of the 1.78 URL until I found a page that listed what other versions were available), and looked at old versions until I found the earliest Boost::regex that I noticed lookbehind syntax documented: v1.33.1 from 2004 – where they already note that restriction. If they’ve known about that limitation since 2004 and never gotten rid of that restriction, there’s probably a good technical reason that it’s too hard to implement, and it’s not likely to change anytime soon.

                                        Alan KilbornA 1 條回覆 最後回覆 回覆 引用 2
                                        • Alan KilbornA
                                          Alan Kilborn @PeterJones
                                          最後由 編輯

                                          @peterjones said in Remove unwanted Carriage Return:

                                          there’s probably a good technical reason that it’s too hard to implement, and it’s not likely to change anytime soon.

                                          I think the limitation probably is rooted in runtime complexity for the engine that would provide a poor user experience (way too long for it to examine every possible match, potential for engine catastrophic overflow, etc.). Just my hunch.

                                          Neil SchipperN 1 條回覆 最後回覆 回覆 引用 2
                                          • Neil SchipperN
                                            Neil Schipper @Alan Kilborn
                                            最後由 編輯

                                            @alan-kilborn @astrosofista @PeterJones

                                            Open ended variable length negative look-behinds using subexpressions like hello.* would have huge performance implications.

                                            Upper-bounded variable length expressions like dog|puppy or \d{1,500} would be “easy-peasy” (ie, computers doing exactly what computer are good at doing), and extremely useful.

                                            PeterJonesP 2 條回覆 最後回覆 回覆 引用 2
                                            • 第一個貼文
                                              最後的貼文
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors