How to locate text separators and replace them?



  • Based on that file, it turns out the problem is not malformed CSV. The problems is that there are multilined cells. The google drive link you posted rendered it as a reasonable CSV, with the proper multiline cells. For example, the first line that @guy038 used as an example, line 12851, which is actually part of ID=641279561181

    641279561181,"Your security advisor reports that Mr. Altmann somehow found out he was being watched. He managed to steal a space suit and disappear. Security officers found a package and a note in his apartment:
    
    ""Commander, I have my reasons to believe that you will grasp what I have to say. I am an ambassador from another civilization which conquered space long ago. My mission here is done and I assure you that I have the best possible impression of your people. I shall advocate for you in the Galactic Council. Now I know I have to disappear for the good of everyone. Don't look for me, for you will not find me. 
    As a parting gift, I'm leaving this growth substance. Spread it throughout Mars and let it flourish.""
    
    Your botanists confirm that the substance is some sort of a moss which can slowly replicate.
    
    <effect>John Altmann disappeared. Mars has gained <resource(terr_veg,'VegetationTP')>.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/TheManFromMars_FollowUp3.lua(00018),StoryBit TheManFromMars_FollowUp3 Text,,,,243013,0,891
    

    and the complete version of the example you mangled,

    504780607315,"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah... But was actually here, on Mars!
    
    Our restless efforts to purify the Martian atmosphere and make it suitable for humans have never achieved a more significant milestone. We are halfway there! But there is a lot of work remaining to be done.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/PopupNotificationPreset.lua(00087),PopupNotificationPreset AtmosphereCleared text,,,,240619,0,437
    

    My plan of attack for such a file would be … to use a proper CSV parser, like Excel or one of the free spreadsheet programs (Google Spreadsheet, or Open Office Calc, or LibreOffice Calc, or pre-formed libraries for whatever programming language you are using, rather than rolling your own incorrect CSV-parser in that language…) not a regex. Because a proper parser sees all those correctly, and you don’t have to translate any of the separators.

    I’m not sure what you’re really trying to do, because replacing the separator won’t really help you: it’s already correctly formed for a CSV spreadsheet (I just made sure OpenOffice Calc could read it just fine, and properly recognized the multi-line cells). So I can only assume after “fixing” it with Notepad++, you are trying to run it through some other program you or a friend/coworker/online-hacker have written that doesn’t properly handle well-formed multiline cells in CSV text. But by going down that route, you are just going to corrupt good data, not fix it.

    So, rather than the XY-problem of “give me a regex that will change the separators in this CSV”, tell us what you’re actually trying to accomplish, and we can give hints whether that would be better done in Notepad++ or in some other tool.



  • And as far as @guy038 's processing: I think the better first step would be to merge any line that doesn’t start with a 12-digit number with the line before it (using your magic never-used character… some unused punctuation/symbol, or perhaps some rare unicode character); once you do that, all lines should have the right number of commas… Though trying to write a regex that can distinguish between the commas of valid separators and the commas embedded in strings (even after the multiline merge), would be quite difficult. Though if anyone can, you could. :-)



  • @Andrew-Staibuz said in How to locate text separators and replace them?:

    so when a import the file in excel the result will be

    Since you want to process the data with Excel anyway, why do you not import your original CSV file into Excel? When you choose the right method to do that, you will be guided through that process by an Excel wizzard, which can be told to use the comma as field separator. It also allows to choose UTF-8 as the character encoding of your source file.

    [EDIT]
    It is also possible to script this task, have a look at my CSV Viewer script. It lacks the part of saving a copy of the input file with different field delimiters but that’s no difficult job, Google gave me for example this result.
    [/EDIT]



  • Thanks everyone for the suggestion!!!



  • Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

    Aaaaaaah, Peter, how silly I am :-( I’ve just forgot that a standard line could be split on several lines !. So, after deleting some EOL, I ended up with a neat file of 16,324 lines, which contains 13,696 standard lines (with 17 comma delimiters )

    Of course, the final file does not contain lines, with less than 17 commas. However, it remains, as expected, 2,627 lines which contain possible commas in zones, delimited between 2 double-quotes !

    So, I’m still inverstigating for the way to just replace comma delimiters, outside these double-quotes delimited zones. But, Andrew, I am quite confident that I can come up with the solution very soon ;-))

    See you later,

    guy038



  • I Hope you could find a solution soon.
    Meanwhile I have chosen to open the csv with Execel using the import data function and setting up the parametres.
    Then manually apply the correction needed.



  • I wonder if you could use a different sequence:

    1. open with excel
    2. export to a new CSV, but with a different separator character (see this page for instructions to change what separator character Excel will use for exporting CSV). Don’t forget to change the separator character back when done, in case it messes up some other program
    3. open the file in Notepad++ to do other edits, with commas now only existing in the real data, rather than as separators
    4. make your edits
    5. since your separator character was unique, you should just be able to change that character back to comma inside Notepad++
    6. assuming you didn’t play around with the nesting of quotes, your file should still be valid CSV when you’re done


  • Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

    Perfect, Andrew ! I managed to modify your file and get normal records, delimited in 18 columns, with 17 pipe chars ( | ), while keeping the commas in the areas between double quotation marks

    As I explained in my short previous post, I first will delete some exedentary EOL chars in order to get complete lines.

    Note that all S/R are done with :

    • The option Regular expression search mode

    • The option Wrap Around

    • A click on the Replace All button

    Here are the different steps :

    • First, any blank lines or lines with a single space char are replaced with a single space char

      • SEARCH \R\x20?\R

      • REPLACE \x20

    => 2,840 replacements

    • Now, if any line contains less than 17 commas, we changed its EOL char with a single space char

      • SEARCH ^\d+,([^,\r\n]*,){0,15}[^,\r\n]+\K\R

      • REPLACE \x20

    => 243 replacements

    • Then, if any EOL char is NOT followed with a number of digits and a comma ( Beginning of a standard line ) NOR with the very end of file, we changed this EOL, again, with a single space char

      • SEARCH \R(?!\d+,|\Z)

      • REPLACE \x20

    => 292 replacements


    At this point, we could think that your file is quite neat ! Yet, few weird things remain, concerning consecutive double-quotes ( " ) sequences :-(( So, to simplify the problem, I thought that it was better to replace, in some particular cases, two consecutive " chars with a temporary dummy character. I chose the bullet character ( Unicode \x{2022} )

    For instance :

    • The search of """" ( four consecutive double-quotes ) find 1 record, below :
    394318123243,"""Commander, the operation went as planned. We produced vast quantities of the Water of Peace and placed them in them water extractors. When they workd, this water of Piece entered them undergraund deposits. It didnt work purfect as mars wata supply is veri lou but it stil afekted a lots of peoples. We fil gud as we never biin frends liek dis befor. All is wel comander no wore!!!!!"""" It seems this Water of Peace reacted chemically with another element in the Martian waters and now not only affects the natural aggressiveness of human beings, but it also drastically reduces their overall intelligence.  <effect><count_idiots> Colonists gain the Idiot and the Enthusiast traits. Mars has gained <resource(terr_water,'WaterTP')>. The genius loses Sanity.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/MartianPeace_FollowUp4.lua(00032),StoryBit MartianPeace_FollowUp4 Text,,,,243013,0,758 )
    

    After examination, I thought that the sensible replacement was ""• ( 1 manual replacement )

    • Some lines contain text like ...,123,"abc def ""ghi jkl""",456,.... Like above, for these configurations , it’s sensible to replace the inner 2 double-quotes of the expression with the **bullet char, leading to the more simple syntax ...,123,"abc def •ghi jkl•",456,... !

    • So :

      • SEARCH ""([^"\r\n]+)""

      • REPLACE •\1•

    => 339 replacements

    • Now, the search of """ ( three consecutive double-quotes ) find 1 record, below :
    453668807094,"""Er... say what now?",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00019),,,narrator,narrator,238196,,20
    

    After examination, I thought that the sensible replacement was "• ( 1 manual replacement )

    • At last, the search of "" ( two consecutive double-quotes ) find 4 records, in 2 lines, below :
    6265,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioning vehicle.","EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",changed,,,,,Trunk\Mars/Data/Scenario/Mystery_6.lua(00173),Scenario Mystery 6 text,,,,223739,,3
    
    935811765411,"Damn... I... I did not think of this. Actually, I did, I just thought that... I give up. I'll behave."" <effect>Stolen funding was refunded.",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00016),StoryBit BadBoyGenius_Success Text,,,,238196,,140
    

    After examination, again, I thought that the sensible replacement was ( 4 manual replacements )

    Of course, Andrew, it would be better to have a look to these 6 lines manually changed, in order to verify if my choices were correct or no ;-))

    As a verification, the search of "{2,} ( any list of consecutive double-quotes ) does give 0 occurrence ;-)). Now, you’re sure that exiting double-quotes begins or ends a field and that exists, at least, one char between these double-quotes ;-))

    Your file, with correct syntax, has 16,323 lines + 1 header line


    Now, we turn to the main S/R replacements :

    • First, we’ll replace any comma, in zones between double-quotes, with a dummy char, non-existent in your file. I choosed the @ character

      • SEARCH (,"[^"\r\n]*),([^,"\r\n]*")

      • REPLACE \1@\2

    IMPORTANT You’have to click several times, on the Replace All button, till the message Replace All: 0 occurences were replaced. is displayed. So :

    = 3,483 + 1,446 + 705 + 390 + 232 + 137 + 89 + 54 + 39 + 27 + 19 + 17 + 8 + 4 + 3 + 3 + 0 = 6,656 replacements, after 17 clicks on the Replace All button

    • Then, we change any comma delimiter with an other dummy character ( I chosed the Pipe symbol | ) :

      • SEARCH ,

      • REPLACE |

    => 277,508 replacements

    • So, now, we can change our temporary @ in order to get, again, commas, inside the double-quotes zones :

      • SEARCH @

      • REPLACE ,

    => 6,656 replacements

    • And finally, we change our temporary bullet char, to get the original two double-quotes :

    • SEARCH

    • REPLACE ""

    => 684 replacements

    Et voilà !

    As a final check, let’s count the number of lines, containing 18 fields, separated with 17 new delimiters |, in our modified file, with the regex :

    COUNT ^([^|\r\n]*\|){17}[^|\r\n]*$

    Luckily, we get the expected result : 16,324 occurrences, which is the exact number of lines ;-))


    If you prefer, Andrew, here is my e-mail address, below. So, just e-mail me and I 'll send you back the NEW-English.csv file ;-))

    tguy.038@gmail.com

    Best Regards

    guy038

    P.S. :

    If you’d like some additional information on the regexes used, just tell me. Just enough for today ;-))



  • @andrew-staibuz, @peterjones, @dinkumoil and All,

    I’ve found a small error, relative to an unique line, beginning with 394318123243|"""Commander, the operation went as planned............, which contained, initially, 4 consecutive double-quotes (")

    I previously proposed to change the """" string with """,". However, in order that this line has exactly 18 fields, the correct replacement should be ""• !

    Of course, I also updated my previous post and some replacement numbers ;-))

    Cheers,

    guy038



  • Thank you everyone for the replay.
    I have followed The guy038 suggestion and seems works.


Log in to reply