Community
    • Login

    How to locate text separators and replace them?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    21 Posts 7 Posters 10.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones
      last edited by

      And as far as @guy038 's processing: I think the better first step would be to merge any line that doesn’t start with a 12-digit number with the line before it (using your magic never-used character… some unused punctuation/symbol, or perhaps some rare unicode character); once you do that, all lines should have the right number of commas… Though trying to write a regex that can distinguish between the commas of valid separators and the commas embedded in strings (even after the multiline merge), would be quite difficult. Though if anyone can, you could. :-)

      1 Reply Last reply Reply Quote 1
      • dinkumoilD
        dinkumoil @Andrew Staibuz
        last edited by dinkumoil

        @Andrew-Staibuz said in How to locate text separators and replace them?:

        so when a import the file in excel the result will be

        Since you want to process the data with Excel anyway, why do you not import your original CSV file into Excel? When you choose the right method to do that, you will be guided through that process by an Excel wizzard, which can be told to use the comma as field separator. It also allows to choose UTF-8 as the character encoding of your source file.

        [EDIT]
        It is also possible to script this task, have a look at my CSV Viewer script. It lacks the part of saving a copy of the input file with different field delimiters but that’s no difficult job, Google gave me for example this result.
        [/EDIT]

        1 Reply Last reply Reply Quote 3
        • Andrew StaibuzA
          Andrew Staibuz
          last edited by

          Thanks everyone for the suggestion!!!

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by

            Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

            Aaaaaaah, Peter, how silly I am :-( I’ve just forgot that a standard line could be split on several lines !. So, after deleting some EOL, I ended up with a neat file of 16,324 lines, which contains 13,696 standard lines (with 17 comma delimiters )

            Of course, the final file does not contain lines, with less than 17 commas. However, it remains, as expected, 2,627 lines which contain possible commas in zones, delimited between 2 double-quotes !

            So, I’m still inverstigating for the way to just replace comma delimiters, outside these double-quotes delimited zones. But, Andrew, I am quite confident that I can come up with the solution very soon ;-))

            See you later,

            guy038

            1 Reply Last reply Reply Quote 2
            • Andrew StaibuzA
              Andrew Staibuz
              last edited by

              I Hope you could find a solution soon.
              Meanwhile I have chosen to open the csv with Execel using the import data function and setting up the parametres.
              Then manually apply the correction needed.

              1 Reply Last reply Reply Quote 0
              • PeterJonesP
                PeterJones
                last edited by PeterJones

                I wonder if you could use a different sequence:

                1. open with excel
                2. export to a new CSV, but with a different separator character (see this page for instructions to change what separator character Excel will use for exporting CSV). Don’t forget to change the separator character back when done, in case it messes up some other program
                3. open the file in Notepad++ to do other edits, with commas now only existing in the real data, rather than as separators
                4. make your edits
                5. since your separator character was unique, you should just be able to change that character back to comma inside Notepad++
                6. assuming you didn’t play around with the nesting of quotes, your file should still be valid CSV when you’re done
                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

                  Perfect, Andrew ! I managed to modify your file and get normal records, delimited in 18 columns, with 17 pipe chars ( | ), while keeping the commas in the areas between double quotation marks

                  As I explained in my short previous post, I first will delete some exedentary EOL chars in order to get complete lines.

                  Note that all S/R are done with :

                  • The option Regular expression search mode

                  • The option Wrap Around

                  • A click on the Replace All button

                  Here are the different steps :

                  • First, any blank lines or lines with a single space char are replaced with a single space char

                    • SEARCH \R\x20?\R

                    • REPLACE \x20

                  => 2,840 replacements

                  • Now, if any line contains less than 17 commas, we changed its EOL char with a single space char

                    • SEARCH ^\d+,([^,\r\n]*,){0,15}[^,\r\n]+\K\R

                    • REPLACE \x20

                  => 243 replacements

                  • Then, if any EOL char is NOT followed with a number of digits and a comma ( Beginning of a standard line ) NOR with the very end of file, we changed this EOL, again, with a single space char

                    • SEARCH \R(?!\d+,|\Z)

                    • REPLACE \x20

                  => 292 replacements


                  At this point, we could think that your file is quite neat ! Yet, few weird things remain, concerning consecutive double-quotes ( " ) sequences :-(( So, to simplify the problem, I thought that it was better to replace, in some particular cases, two consecutive " chars with a temporary dummy character. I chose the bullet character • ( Unicode \x{2022} )

                  For instance :

                  • The search of """" ( four consecutive double-quotes ) find 1 record, below :
                  394318123243,"""Commander, the operation went as planned. We produced vast quantities of the Water of Peace and placed them in them water extractors. When they workd, this water of Piece entered them undergraund deposits. It didnt work purfect as mars wata supply is veri lou but it stil afekted a lots of peoples. We fil gud as we never biin frends liek dis befor. All is wel comander no wore!!!!!"""" It seems this Water of Peace reacted chemically with another element in the Martian waters and now not only affects the natural aggressiveness of human beings, but it also drastically reduces their overall intelligence.  <effect><count_idiots> Colonists gain the Idiot and the Enthusiast traits. Mars has gained <resource(terr_water,'WaterTP')>. The genius loses Sanity.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/MartianPeace_FollowUp4.lua(00032),StoryBit MartianPeace_FollowUp4 Text,,,,243013,0,758 )
                  

                  After examination, I thought that the sensible replacement was ""• ( 1 manual replacement )

                  • Some lines contain text like ...,123,"abc def ""ghi jkl""",456,.... Like above, for these configurations , it’s sensible to replace the inner 2 double-quotes of the expression with the **bullet • char, leading to the more simple syntax ...,123,"abc def •ghi jkl•",456,... !

                  • So :

                    • SEARCH ""([^"\r\n]+)""

                    • REPLACE •\1•

                  => 339 replacements

                  • Now, the search of """ ( three consecutive double-quotes ) find 1 record, below :
                  453668807094,"""Er... say what now?",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00019),,,narrator,narrator,238196,,20
                  

                  After examination, I thought that the sensible replacement was "• ( 1 manual replacement )

                  • At last, the search of "" ( two consecutive double-quotes ) find 4 records, in 2 lines, below :
                  6265,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioning vehicle.","EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",changed,,,,,Trunk\Mars/Data/Scenario/Mystery_6.lua(00173),Scenario Mystery 6 text,,,,223739,,3
                  
                  935811765411,"Damn... I... I did not think of this. Actually, I did, I just thought that... I give up. I'll behave."" <effect>Stolen funding was refunded.",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00016),StoryBit BadBoyGenius_Success Text,,,,238196,,140
                  

                  After examination, again, I thought that the sensible replacement was • ( 4 manual replacements )

                  Of course, Andrew, it would be better to have a look to these 6 lines manually changed, in order to verify if my choices were correct or no ;-))

                  As a verification, the search of "{2,} ( any list of consecutive double-quotes ) does give 0 occurrence ;-)). Now, you’re sure that exiting double-quotes begins or ends a field and that exists, at least, one char between these double-quotes ;-))

                  Your file, with correct syntax, has 16,323 lines + 1 header line


                  Now, we turn to the main S/R replacements :

                  • First, we’ll replace any comma, in zones between double-quotes, with a dummy char, non-existent in your file. I choosed the @ character

                    • SEARCH (,"[^"\r\n]*),([^,"\r\n]*")

                    • REPLACE \1@\2

                  IMPORTANT You’have to click several times, on the Replace All button, till the message Replace All: 0 occurences were replaced. is displayed. So :

                  = 3,483 + 1,446 + 705 + 390 + 232 + 137 + 89 + 54 + 39 + 27 + 19 + 17 + 8 + 4 + 3 + 3 + 0 = 6,656 replacements, after 17 clicks on the Replace All button

                  • Then, we change any comma delimiter with an other dummy character ( I chosed the Pipe symbol | ) :

                    • SEARCH ,

                    • REPLACE |

                  => 277,508 replacements

                  • So, now, we can change our temporary @ in order to get, again, commas, inside the double-quotes zones :

                    • SEARCH @

                    • REPLACE ,

                  => 6,656 replacements

                  • And finally, we change our temporary • bullet char, to get the original two double-quotes :

                  • SEARCH •

                  • REPLACE ""

                  => 684 replacements

                  Et voilà !

                  As a final check, let’s count the number of lines, containing 18 fields, separated with 17 new delimiters |, in our modified file, with the regex :

                  COUNT ^([^|\r\n]*\|){17}[^|\r\n]*$

                  Luckily, we get the expected result : 16,324 occurrences, which is the exact number of lines ;-))


                  If you prefer, Andrew, here is my e-mail address, below. So, just e-mail me and I 'll send you back the NEW-English.csv file ;-))

                  Best Regards

                  guy038

                  P.S. :

                  If you’d like some additional information on the regexes used, just tell me. Just enough for today ;-))

                  1 Reply Last reply Reply Quote 2
                  • guy038G
                    guy038
                    last edited by guy038

                    @andrew-staibuz, @peterjones, @dinkumoil and All,

                    I’ve found a small error, relative to an unique line, beginning with 394318123243|"""Commander, the operation went as planned............, which contained, initially, 4 consecutive double-quotes (")

                    I previously proposed to change the """" string with """,". However, in order that this line has exactly 18 fields, the correct replacement should be ""• !

                    Of course, I also updated my previous post and some replacement numbers ;-))

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 2
                    • Andrew StaibuzA
                      Andrew Staibuz
                      last edited by

                      Thank you everyone for the replay.
                      I have followed The guy038 suggestion and seems works.

                      1 Reply Last reply Reply Quote 0
                      • Patrick ChatelleP
                        Patrick Chatelle
                        last edited by Patrick Chatelle

                        This seemed like it should be top of list, but here goes

                        I need to replace comma “,” with a pipe “|”, but only if a comma immediately follow a 2 digit decimal.

                        So
                        ,49.30,49.30,.00,.00,.00,
                        needs to be
                        |49.30|49.30|.00|.00|.00|

                        notepad++ (Regular expression radio button clicked)
                        in the “Find what:” box -> (?:[.\d\d]{0},)
                        in the “Replace with:” box -> |

                        add “\d” for more decimals.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors