• Login
Community
  • Login

Finding and Removing LF within a comma delimited comments field

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
4 Posts 3 Posters 1.8k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    John Ditzel
    last edited by Aug 6, 2019, 10:22 PM

    Hi, I have a comma delimited file with double quotes as a text qualifier. There are five columns. The last is a “comments” field where some users entered line feeds. That’s causing fragmented lines in my output file. I can’t fix the source, so I need to fix the output. A line of text basically goes:
    “John”,“Smith”,“Chicago”,“Illinois”,“Likes to bicycle and [LF]
    swim.”[LF]
    I tried to Find using ([^"])$\n as a Regular Expression and replace with SPACE, but for some reason that selects the entire file (then replaced with a blank space, ha! fail). I appreciate any help!!

    A 1 Reply Last reply Aug 7, 2019, 12:19 AM Reply Quote 2
    • A
      Alan Kilborn @John Ditzel
      last edited by Aug 7, 2019, 12:19 AM

      @John-Ditzel

      You have a real problem here. If I had this problem I might try replacing a non-double-quote (which I capture into group #1) followed by a \n with the group #1 character. Here’s how to tell Notepad++ to do it:

      Find what box: ([^"])\n
      Replace with box: \1
      Search mode: Regular expression

      I’m not sure your text always follows that specification, but from your description it appears as if it might.

      Actually, looking back at what you were trying, I must say you were pretty close to what I came up with.

      We always like to see those asking for help here show us what they’ve already tried and failed with, so hat’s off to ya for that! :)

      1 Reply Last reply Reply Quote 2
      • J
        John Ditzel
        last edited by Aug 7, 2019, 2:24 PM

        That worked perfectly! Thank you so much Alan!!

        1 Reply Last reply Reply Quote 1
        • G
          guy038
          last edited by guy038 Aug 13, 2019, 6:21 PM Aug 13, 2019, 2:38 PM

          Hello, @john-ditzel and All,

          I tried to re-visit the general problem, when a list of records, in a delimited file, is completely unformatted and may contain :

          • Line-break(s) and/or blank character(s), possibly consecutive, inserted before and/or **after any field separator symbol ( generally a comma, semi-colon or colon char )

          • Blank character(s), possibly consecutive, located at beginning or end of any field contents, right before and/or after the delimiter symbol ( generally a double or single quote char )

          • Line-break(s), possibly consecutive, located inside any field contents, between two standard characters, different from, both, the delimiter and the separator symbols

          • Additional line-break(s) and/or blank character(s), possibly consecutive, between two records

          • Unnecessary space characters char(s)( or tabulation chars, when not a delimiter ) inside field contents


          So, here is a way to get a neat formatted list :

          • Let SEP be the field separator ( The , character, in your case )

          • Let DEL be the starting AND ending character delimiter ( The " character, in your case )

          IMPORTANT : The separator character is NEVER part of any field value !


          • IF the separator is not the tabulation character, use these four generic regex S/R, below, in that order :

            • SEARCH  1 [\h\r\n]*(SEP)[\h\r\n]*|DEL\K[\h\r\n]*(?=DEL|\z)

            • REPLACE 1 ?1SEP:\r\n

            • SEARCH  2 [^DELSEP\h\r\n]\K(\h*\R+)+\h*(?=[^DELSEP\h\r\n])

            • REPLACE 2 \x20

            • SEARCH  3 (^|SEP)DEL\K[\h\r\n]+|[\h\r\n]+(?=DEL(SEP|$))

            • REPLACE 3 Leave EMPTY

            • SEARCH  4 (['-])\h+|\x20{2,}|\t+

            • REPLACE 4 ?1\1:\x20


          • IF the separator is the tabulation character, use these four generic regex S/R, below, in that order :

            • SEARCH  1 [\x20\r\n]*(\t)[\x20\r\n]*|DEL\K[\x20\r\n]*(?=DEL|\z)

            • REPLACE 1 ?1\t:\r\n

            • SEARCH  2 [^DEL\h\r\n]\K(\x20*\R+)+\x20*(?=[^DEL\h\r\n])

            • REPLACE 2 \x20

            • SEARCH  3 (^|\t)DEL\K[\x20\r\n]+|[\x20\r\n]+(?=DEL(\t|$))

            • REPLACE 3 Leave EMPTY

            • SEARCH  4 (['-])\x20+|\x20{2,}

            • REPLACE 4 ?1\1:\x20


          Practical example :

          Assuming the sample list, below, containing 3 records, and, really, very, very twisted !

          "John","Smith","Chicago","Illinois","Likes     to bicycle
          and    
          
                    
          swim.   "   
          	
             
                 "    Mary   
             Elisabeth   "  ,  "   
             
            Ditzel   
          
            "  ,"   Los
          
          
          Angeles"     ,    "California","    Love        skiing
          and
          
          snow-
                 boarding"
          			"Andrew    "	,  
          	"
          Mc		
          Millan  " ,    "San
          Francisco  "  ,  "  California"
          
          	     
              ,		
          
          		
             "  This    is   a		small	
          
          
          Test		"
                 
          			
          
          

          With the first regex S/R, below,

          • We delete any range of blank and/or EOL characters, located outside fields and which surrounds separators

          • We replace any range of blank and/or EOL characters; between two records, with a single line-break

          SEARCH [\h\r\n]*(,)[\h\r\n]*|"\K[\h\r\n]*(?="|\z)

          REPLACE ?1,:\r\n

          So, the text is changed into :

          "John","Smith","Chicago","Illinois","Likes     to bicycle
          and    
          
                    
          swim.   "
          "    Mary   
             Elisabeth   ","   
             
            Ditzel   
          
            ","   Los
          
          
          Angeles","California","    Love        skiing
          and
          
          snow-
                 boarding"
          "Andrew    ","
          Mc		
          Millan  ","San
          Francisco  ","  California","  This    is   a		small	
          
          
          Test		"
          

          Now, after running the second regex S/R, below, which replaces any range of blank and/or EOL chars, containing, at least, one EOL char, between 2 non-delimiter/blank/EOL characters, with a single space character :

          SEARCH [^",\h\r\n]\K(\h*\R+)+\h*(?=[^",\h\r\n])

          REPLACE \x20

          We get the changed list, below :

          "John","Smith","Chicago","Illinois","Likes     to bicycle and swim.   "
          "    Mary Elisabeth   ","   
             
            Ditzel   
          
            ","   Los Angeles","California","    Love        skiing and snow- boarding"
          "Andrew    ","
          Mc Millan  ","San Francisco  ","  California","  This    is   a		small Test		"
          

          Then, using this third regex S/R, which deletes any consecutive range of blank and/or EOL characters, beginning or ending any field :

          SEARCH (^|,)"\K[\h\r\n]+|[\h\r\n]+(?="(,|$))

          REPLACE Leave EMPTY

          We get :

          "John","Smith","Chicago","Illinois","Likes     to bicycle and swim."
          "Mary Elisabeth","Ditzel","Los Angeles","California","Love        skiing and snow- boarding"
          "Andrew","Mc Millan","San Francisco","California","This    is   a		small Test"
          

          From now on, we almost have our neat list, without unnecessary line-breaks ! Finally, with that fourth regex S/R, we normalize any consecutive range of blank characters to a single space character, except when right after a dash or apostrophe. In that case, the space char, wrongly added, by the second S/R above, is, then, deleted !

          SEARCH (['-])\h+|\x20{2,}|\t+

          REPLACE ?1\1:\x20

          And, here is our final and expected list :

          "John","Smith","Chicago","Illinois","Likes to bicycle and swim."
          "Mary Elisabeth","Ditzel","Los Angeles","California","Love skiing and snow-boarding"
          "Andrew","Mc Millan","San Francisco","California","This is a small Test"
          

          Regards

          guy038

          1 Reply Last reply Reply Quote 1
          3 out of 4
          • First post
            3/4
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors