How to locate text separators and replace them?



  • Hello,
    I have a csv file coding in UTF-8 format that contain a lot of string separate from comma.
    Comma is also included in the text as normal charachter.

    I want to replace the comma that work like a separator with another like semicolon (";") and keep the comma within the text unchanged.
    It’s possible using the regular expression?

    e.g.
    The original string is:
    504780607315,"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!, oldtext, other

    As you can see the first the fifth and the sixth commas are a delimiter character, the second, the third and fourth instead are the commas within the text.

    what I want to get is:

    504780607315;"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!; oldtext; other

    so when a import the file in excel the result will be:

    Column A :504780607315
    Column B : "Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!
    Column C: oldtext
    Column D: other

    I hope someone can help me
    Thanks in advance
    Andrea



  • @Andrew-Staibuz

    from the given example this might do the job
    find what: ^(.*?),(.*),(.*?),(.*?)$
    replace with:\1;\2;\3;\4



  • @Ekopalypse said in How to locate text separators and replace them?:

    \1;\2;\3;\4

    Thank you for your replay but it seems doesn’t work proprely



  • @Andrew-Staibuz said in How to locate text separators and replace them?:

    @Ekopalypse said in How to locate text separators and replace them?:

    \1;\2;\3;\4

    Thank you for your replay but it seems doesn’t work proprely

    Did you remember to enable regular expression mode in the search-and-replace dilog?

    Can you explain what didn’t work properly? What did that regex generate, and how did it not match what you want? (using the </> button to markup example text will remove ambiguity)

    We’d love to help, but we need you to help us help you by supplying more information



  • because when I tried @Ekopalypse’s regex on the following two lines of data:

    504780607315,"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!, oldtext, other
    123456789012,"this text has a bunch of commas,,,,,,,,,,,which are all real, but the last two commas, coming soon, should be preserved, oldtext, other
    

    I got

    504780607315;"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!; oldtext; other
    123456789012;"this text has a bunch of commas,,,,,,,,,,,which are all real, but the last two commas, coming soon, should be preserved; oldtext; other
    

    Which is exactly what I would expect from your description of the problem.

    So either you’re doing something wrong, or you haven’t given us an accurate representation of your text, or your definition of “doesn’t work” includes something different than “the regex didn’t transform the text the way I described it should”

    If by “it doesn’t work”, you mean that Excel didn’t read the CSV the way you intended it to:

    1. Did you tell Excel that you wanted semicolon as the separator?

    2. Did you know that your CSV is horribly mis-formed, even after the replacement?

      • In a CSV file, if there’s a quote at the start of a given column, compliant CSV parsers (including Excel) will expect a closing quote as well; anything between those quotes will be considered part of that one column, including any separator characters embedded in the text.
      • So 12345,"hello,,,,,,world",three has three columns: 12345, then hello,,,,,,world, then finally three.
      • But if your text is missing the end-quote (and the example text you showed is missing the end-quote – unless the forum mangled your text because you forgot to use the </> button in the forum editor to mark the text as “code”, so that it wouldn’t interpret it as Markdown), then I think Excel will continue looking on the next line for the end-quote. (I’m not sure, and I don’t have Excel on my home machine, so cannot test).


  • I even noticed in the forum’s code-highlighting of the example text, it knows what a CSV looks like:

    malformed:
    504780607315,"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!, oldtext, other
    

    vs

    correct CSV:
    504780607315,"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah… But was actually here, on Mars!", oldtext, other
    

    Notice how it stops being green after the end-quote.

    Or, a shorter example:

    malformed,"this quote opened it,blah,blah
    yada,",notice that yada comma is still part of the string started on the previous row, new column, new column
    vs
    correct,"this quote is complete",blah,blah
    yada,notice that here, yada is not green
    


  • First of all thanks you both for your help.
    Surely I haven’t given you an accurate representation of my text because it’s more complicated than I expected.
    So below you can find the link to take a look to the file I want to edit (that is the localization file of Surviving Mars).
    I want to translate it in my own language and do it in the simplest way possible.
    So I open it with Notepad ++
    Then I try to transform comma in semicolon to better separate the text from the separator.
    Only at this point I import the file in Excel to put in a column the data.
    As you can see not all the strings have a quote at the start and at the end.
    Some strings contain comma that aren’t separator.
    Not all the strings are edit in the same way.

    I hope that explanation help you

    [https://drive.google.com/file/d/1_vHlq3znkYVYfAAbuw4R2n0I1wp5iYj6/view?usp=sharing](link url)



  • @Andrew-Staibuz
    page that you are shared is not actually shared to public
    we have not permission
    https://drive.google.com/file/d/1_vHlq3znkYVYfAAbuw4R2n0I1wp5iYj6/view?usp=sharing



  • sorry now is pubblic



  • Hi, @andrew-staibuz and All

    Seemingly, your table has 22,538 rows + the header row. This header contains the 18 columns, below

    ID
    Text
    Translation
    Old Text
    Old Translation
    Status
    Gender
    Critical Warnings
    Non-critical Warnings
    Platform Warnings
    Location
    Context
    Section
    Actor
    Voice Actor
    Revision
    Old Revision
    Edit Distance
    

    So, a standard record, which does not contain any comma in text, should contain 18 colums, separated by 17 comma delimiter. Your file contains 13,126 of these standard records

    For theses ones, it’s not very difficult to change any comma delimiter with a semicolon delimiter. However, note that your file already contains 114 semicolon :-((. So, I would suggest you to use, for instance, the Tabulation character ( \t ) as a delimiter

    We’ll proceed in two steps :

    A) The following regex S/R will add a dummy character ( @ ) , non-existant in your file, at the end of any of these standard lines

    SEARCH ^([^,\r\n]*?,){17}[^,\r\n]+$

    REPLACE $0@

    B) The following regex S/R will replace any comma with a tabulation char , only if the line ends with the dummy char @ and will, also, delete the dummy character @, when all commas of the current line have been replaced

    SEARCH (,)(?=.*@)|@

    REPLACE ?1\t


    Now, the bad news is that the other records, of your file, do not have a regular structure ! For instance, here are, below, a few non-standard lines, sorted alphabetically. Rather difficult to link the parts of each line to the different areas of the header !?

    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line 12,851    ""Commander, I have my reasons to believe that you will grasp what I have to say. I am an ambassador from another civilization which conquered space long ago. My mission here is done and I assure you that I have the best possible impression of your people. I shall advocate for you in the Galactic Council. Now I know I have to disappear for the good of everyone. Don't look for me, for you will not find me. 
    Line 12,685    ""Why me?""
    Line 11,980    ","
    Line 11,981    ",,,OK,,,,,Trunk\Mars/Dlc/armstrong/Presets/PopupNotificationPreset.lua(00065),,,narrator,narrator,239549,0,
    Line  5,573    'An anonymous yet reliable source disclosed to us that you ordered and succeeded in the deliberate destruction of at least one of the alien vessels. My country condemns any premature aggressive actions which could taint the reputation of humankind, especially when addressing such a delicate matter. What gave you the authority to act in such a risky manner without an official resolution from the Council of the United Nations?'","They handle most of the questions with ease and require little input from you. And then you are confronted with a direct question from one of the delegates.
    Line  6,519    'Yes, I’m aware. I think she might be evolving into something bigger.'
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line 20,816    (Hold) Infopanel shortcuts",,,,new,,,,,Trunk\Mars/Lua/XTemplates/GamepadControlsOverview.lua(00375),XTemplate GamepadControlsOverview Text,,,,234062,,43
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line  2,566    - 5 extra starting technologies
    Line 14,132    - <em>Advanced Orbital Probe</em> - scans five sectors
    Line 14,142    - Additional Rockets are significantly cheaper
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line  3,106    1. Select the Spacebar and <right_click> on one of the work slots in the first 2 shifts until the slot is shown in red.
    Line  2,950    4. Press <ButtonA> near the marked area above the main base as an end point of the route.",,,,new,,,,,Trunk\Mars/Data/OnScreenHint.lua(00801),OnScreenHint Tutorial_2_TranportRoutes gamepad_text,,,,230265,,347
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line 15,426    17,"Period, base (s)",,,,new,,,,,Trunk\Mars/Lua/Buildings/Anomaly.lua(00407),,,,,221724,,16
                   ...........................
                   ..  Increasing numbers,  ..
                   ...........................
    Line  9,803    999445241150,"Now an entire Dome suffers from depression and project it back towards the Empath, making things even worse. Your head thumps with the pain and despair felt by the empath <DisplayName> while you’re trying to think out a solution to the crisis. 
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line 19,431    </if><if(skipped)>The following mods will not be loaded because of missing or incompatible dependencies:
    Line  2,665    <ButtonA> - land the Rocket",,,,new,,,,,Trunk\Mars/Data/OnScreenHint.lua(00026),OnScreenHint HintRocket gamepad_text,,,,223160,,136
    Line  3,843    <ColonistName(colonist)>’s life has been a journey towards a vision, a journey which brought us all towards the possibility of living on the Red Planet, one small step after at a time. <ColonistName(colonist)> didn’t try to make their mark in human history, for there is no room for attempts in success. <ColonistName(colonist)> knew what the calling of fate was, knew the risks and the opportunities and made the best of it, for the good of us all.
    Line 20,600    <CompletedText></if>",,,,new,,,,,Trunk\Mars/Lua/XTemplates/ChallengeListItem.lua(00009),XTemplate ChallengeListItem RolloverText,,,,236165,,110
    Line  8,183    <DisplayName> has been nominated for a Nobel Prize in medical research, followed by a flood of donations and new applicants wanting to become part of our colony.
    Line 19,250    <MaintenanceText>",,Building condition deteriorates over time. Martian dust and disasters contribute to deterioration of outside buildings. Deteriorated buildings will need to be serviced by a Drone and to be supplied with their required maintenance resource or they will malfunction.<newline><newline>Current deterioration<right><percent(MaintenanceProgress)><newline><left>Last serviced<right><LastMaintenance>,The condition of buildings deteriorates over time. Martian dust and disasters contribute to deterioration of outside buildings. Deteriorated buildings will need to be serviced by a Drone and supplied with their required maintenance resource or they will malfunction.<newline><newline>Current deterioration<right><percent(MaintenanceProgress)><newline><left>Last serviced<right><LastMaintenance>,changed,,,,,Trunk\Mars/Lua/RequiresMaintenance.lua(00539),XTemplate sectionMaintenance RolloverText,,,,233244,,72
    Line 20,835    <ShortcutName('actionSpeedDown')> -  decrease game speed
    Line 21,747    <UIRefundRes> will be refunded upon salvage.",,,,new,,,,,Trunk\Mars/Lua/XTemplates/ipMultiSelect.lua(00119),XTemplate ipMultiSelect RolloverText,,,,240619,0,72
    Line  1,420    <center><image UI/Common/rollover_line.tga 2000> <left>
    Line  8,569    <colonists_pissed_1>% of Religious Colonists lose <morale_loss> Morale for <sols(pissed_time)> Sols.
    Line  2,322    <cut_if_not_platform('desktop')>
    Line 20,476    <dependencies>
    Line 12,314    <drones_lost> Drones have been lost.
    Line  4,784    <effect>  We discovered 30 Rare Metals at the anomaly site","After minimal processing, the resources can be transported and put to good use for the benefit of the Colony.
    Line 11,811    <em>Water</em> is increased by evaporation from <em>Lakes</em> and by completing <em>Special Projects</em>.
    Line 20,268    <formatedbuildinginfo('ElectricitySwitch')>",,,,new,,,,,Trunk\Mars/Lua/X/BuildMenu.lua(00916),,,,,226204,,58
    Line  9,111    <funding(offered_funding)> in exchange for <base_price_reduction>% of all revenue earned from any Rare Metal exports we make in the future. ",,,,new,,,,,Trunk\Mars/Data/StoryBit/MutualInterests.lua(00032),StoryBit MutualInterests Text,,,,234708,,167
    Line 13,490    <goal>You will be put through a series of trials. Metatron will create waves of anomalies which will test your ingenuity and resolve. Adapt and survive!
    Line 10,677    <grey>""All fungi are edible. Some fungi are only edible once."" 
    Line 11,167    <grey>'We should not just consume hydrocarbon fuel but use it to develop nuclear energy, hydro power and renewable energy sources.'
    Line 11,325    <grey>A friendly and supportive community is the key to making life easier in the harsh reality of the Red Planet.</grey>",,"Residential buildings Upgrade (<em>Home Collective</em>) - increases the Service Comfort of the building by 10.
    Line 10,712    <grey>“One touch of nature makes the whole world kin.”
    Line 20,464    <hard>
    Line  4,752    <hint> Check the map for the new locations","On July 7th, 1988 the Soviet Union launched two Proton-K rockets carrying the Phobos-1 and Phobos-2 probes on a course to Mars. Of the two craft only Phobos-2 reached its final destination, but contact with it was lost just before it could deliver its precious payload.
    Line 20,599    <if(Completed)>
    Line 10,880    <if(has_dlc('contentpack3'))>New Building: <em>Large Oxygen Tank</em> (<buildinginfo('OxygenTank_Large')>) - stores very large amounts of oxygen.</if>
    Line 15,338    <left>Collaboration loss<right><EstimatedDailyLoss>
    Line 13,696    <list>",,,,new,,,,,Trunk\Mars/Dlc/gagarin/Code/RivalColonies.lua(00613),,,,,235024,,60
    Line 19,429    <mods>
    Line 12,293    <morale_percent>% Engineer colonists lose <morale_loss> Morale for <sols(morale_sols)> Sols.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/ThoseDirtyShuttles_FollowUp_opt.lua(00170),StoryBit ThoseDirtyShuttles_FollowUp_opt Text,,,,242226,0,320
    Line  8,563    <new_saints> more Colonists have become Saints.",,,,new,,,,,Trunk\Mars/Data/StoryBit/ForeignerInAForeignLand_FollowUp2.lua(00088),StoryBit ForeignerInAForeignLand_FollowUp2 Text,,,,238668,0,601
    Line    672    <platform_specific_text>",,,,new,,,,,Trunk\CommonLua/ParadoxMenu.lua(00157),,,,,238668,,90
    Line  7,049    <reg_param1> has sent a request that we start accepting refugees. They hope this drastic measure to get people to safety will highlight the absurdity of regional governments not taking in refugees and put pressure on them by their own populace.
    Line 11,078    <right>Wilson Greatbatch</grey><left>",,,,new,,,,,Trunk\Mars/Data/TechPreset.lua(02278),TechPreset FusionAutoregulation description,,,,228384,,315
    Line 21,399    <right_click> Cancel Biorobot <newline><center>Ctrl + <left_click> Construct five Biorobots",,,,new,,,,,Trunk\Mars/Lua/XTemplates/customDroneFactory.lua(00044),XTemplate customDroneFactory RolloverHint,,,,230711,,123
    Line  8,584    <security_happy>% of Security Officers gain <morale_gain> Morale for <sols(happy_time)> Sols.",,,,new,,,,,Trunk\Mars/Data/StoryBit/ForeignerInAForeignLand_FollowUp2.lua(00217),StoryBit ForeignerInAForeignLand_FollowUp2 Text,,,,238668,0,343
    Line 19,433    <skipped>
    Line xx,xxx    <skipped>",,,,new,,,,,Trunk\CommonLua/Classes/Mod.lua(01802),,,,,243121,0,97
    Line 20,468    <soft>
    Line  9,698    <sponsor_name> has suspended the construction of any new Fusion Reactors until investigations are complete.
    Line  6,350    >>> You want me to ask you a question?
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line 12,113    A Marsquake has just started.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/BabyVolcano.lua(00135),StoryBit BabyVolcano Text,,,,243013,0,433
    Line  2.822    Zoom - <middle_click> or <em>PgUp/PgDown</em>
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    Line 13,508    ‘Adapt, expand, evolve.’ These are its words.
    Line  6,515    “Yes, I’m aware. I think she might be evolving into something bigger.”
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    

    For instance, the line 9,803, above, contains only 2 commas !!


    Even worse, assuming, for instance, the standard line 12,687, below, containing 17 commas, as expected :

    ""'Cause you're the Commander, of course"" - she looks disappointed how silly adults are. ""Will you come?""",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/Pyramid_FollowUp_Xenoform_1a.lua(00033),StoryBit Pyramid_FollowUp_Xenoform_1a Text,,,,242226,0,341
    

    It contains 5 double-quote chars ( " ). Obviously, not an even number :-(( So, how to easily separate the different zones ?

    Best Regards,

    guy038



  • Based on that file, it turns out the problem is not malformed CSV. The problems is that there are multilined cells. The google drive link you posted rendered it as a reasonable CSV, with the proper multiline cells. For example, the first line that @guy038 used as an example, line 12851, which is actually part of ID=641279561181

    641279561181,"Your security advisor reports that Mr. Altmann somehow found out he was being watched. He managed to steal a space suit and disappear. Security officers found a package and a note in his apartment:
    
    ""Commander, I have my reasons to believe that you will grasp what I have to say. I am an ambassador from another civilization which conquered space long ago. My mission here is done and I assure you that I have the best possible impression of your people. I shall advocate for you in the Galactic Council. Now I know I have to disappear for the good of everyone. Don't look for me, for you will not find me. 
    As a parting gift, I'm leaving this growth substance. Spread it throughout Mars and let it flourish.""
    
    Your botanists confirm that the substance is some sort of a moss which can slowly replicate.
    
    <effect>John Altmann disappeared. Mars has gained <resource(terr_veg,'VegetationTP')>.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/TheManFromMars_FollowUp3.lua(00018),StoryBit TheManFromMars_FollowUp3 Text,,,,243013,0,891
    

    and the complete version of the example you mangled,

    504780607315,"Today the sun rose above a scene so spectacularly normal, that it was everything but. Blueish skies hanging above the arid, grassy fields of what might as well have been the great plains of Utah... But was actually here, on Mars!
    
    Our restless efforts to purify the Martian atmosphere and make it suitable for humans have never achieved a more significant milestone. We are halfway there! But there is a lot of work remaining to be done.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/PopupNotificationPreset.lua(00087),PopupNotificationPreset AtmosphereCleared text,,,,240619,0,437
    

    My plan of attack for such a file would be … to use a proper CSV parser, like Excel or one of the free spreadsheet programs (Google Spreadsheet, or Open Office Calc, or LibreOffice Calc, or pre-formed libraries for whatever programming language you are using, rather than rolling your own incorrect CSV-parser in that language…) not a regex. Because a proper parser sees all those correctly, and you don’t have to translate any of the separators.

    I’m not sure what you’re really trying to do, because replacing the separator won’t really help you: it’s already correctly formed for a CSV spreadsheet (I just made sure OpenOffice Calc could read it just fine, and properly recognized the multi-line cells). So I can only assume after “fixing” it with Notepad++, you are trying to run it through some other program you or a friend/coworker/online-hacker have written that doesn’t properly handle well-formed multiline cells in CSV text. But by going down that route, you are just going to corrupt good data, not fix it.

    So, rather than the XY-problem of “give me a regex that will change the separators in this CSV”, tell us what you’re actually trying to accomplish, and we can give hints whether that would be better done in Notepad++ or in some other tool.



  • And as far as @guy038 's processing: I think the better first step would be to merge any line that doesn’t start with a 12-digit number with the line before it (using your magic never-used character… some unused punctuation/symbol, or perhaps some rare unicode character); once you do that, all lines should have the right number of commas… Though trying to write a regex that can distinguish between the commas of valid separators and the commas embedded in strings (even after the multiline merge), would be quite difficult. Though if anyone can, you could. :-)



  • @Andrew-Staibuz said in How to locate text separators and replace them?:

    so when a import the file in excel the result will be

    Since you want to process the data with Excel anyway, why do you not import your original CSV file into Excel? When you choose the right method to do that, you will be guided through that process by an Excel wizzard, which can be told to use the comma as field separator. It also allows to choose UTF-8 as the character encoding of your source file.

    [EDIT]
    It is also possible to script this task, have a look at my CSV Viewer script. It lacks the part of saving a copy of the input file with different field delimiters but that’s no difficult job, Google gave me for example this result.
    [/EDIT]



  • Thanks everyone for the suggestion!!!



  • Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

    Aaaaaaah, Peter, how silly I am :-( I’ve just forgot that a standard line could be split on several lines !. So, after deleting some EOL, I ended up with a neat file of 16,324 lines, which contains 13,696 standard lines (with 17 comma delimiters )

    Of course, the final file does not contain lines, with less than 17 commas. However, it remains, as expected, 2,627 lines which contain possible commas in zones, delimited between 2 double-quotes !

    So, I’m still inverstigating for the way to just replace comma delimiters, outside these double-quotes delimited zones. But, Andrew, I am quite confident that I can come up with the solution very soon ;-))

    See you later,

    guy038



  • I Hope you could find a solution soon.
    Meanwhile I have chosen to open the csv with Execel using the import data function and setting up the parametres.
    Then manually apply the correction needed.



  • I wonder if you could use a different sequence:

    1. open with excel
    2. export to a new CSV, but with a different separator character (see this page for instructions to change what separator character Excel will use for exporting CSV). Don’t forget to change the separator character back when done, in case it messes up some other program
    3. open the file in Notepad++ to do other edits, with commas now only existing in the real data, rather than as separators
    4. make your edits
    5. since your separator character was unique, you should just be able to change that character back to comma inside Notepad++
    6. assuming you didn’t play around with the nesting of quotes, your file should still be valid CSV when you’re done


  • Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

    Perfect, Andrew ! I managed to modify your file and get normal records, delimited in 18 columns, with 17 pipe chars ( | ), while keeping the commas in the areas between double quotation marks

    As I explained in my short previous post, I first will delete some exedentary EOL chars in order to get complete lines.

    Note that all S/R are done with :

    • The option Regular expression search mode

    • The option Wrap Around

    • A click on the Replace All button

    Here are the different steps :

    • First, any blank lines or lines with a single space char are replaced with a single space char

      • SEARCH \R\x20?\R

      • REPLACE \x20

    => 2,840 replacements

    • Now, if any line contains less than 17 commas, we changed its EOL char with a single space char

      • SEARCH ^\d+,([^,\r\n]*,){0,15}[^,\r\n]+\K\R

      • REPLACE \x20

    => 243 replacements

    • Then, if any EOL char is NOT followed with a number of digits and a comma ( Beginning of a standard line ) NOR with the very end of file, we changed this EOL, again, with a single space char

      • SEARCH \R(?!\d+,|\Z)

      • REPLACE \x20

    => 292 replacements


    At this point, we could think that your file is quite neat ! Yet, few weird things remain, concerning consecutive double-quotes ( " ) sequences :-(( So, to simplify the problem, I thought that it was better to replace, in some particular cases, two consecutive " chars with a temporary dummy character. I chose the bullet character ( Unicode \x{2022} )

    For instance :

    • The search of """" ( four consecutive double-quotes ) find 1 record, below :
    394318123243,"""Commander, the operation went as planned. We produced vast quantities of the Water of Peace and placed them in them water extractors. When they workd, this water of Piece entered them undergraund deposits. It didnt work purfect as mars wata supply is veri lou but it stil afekted a lots of peoples. We fil gud as we never biin frends liek dis befor. All is wel comander no wore!!!!!"""" It seems this Water of Peace reacted chemically with another element in the Martian waters and now not only affects the natural aggressiveness of human beings, but it also drastically reduces their overall intelligence.  <effect><count_idiots> Colonists gain the Idiot and the Enthusiast traits. Mars has gained <resource(terr_water,'WaterTP')>. The genius loses Sanity.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/MartianPeace_FollowUp4.lua(00032),StoryBit MartianPeace_FollowUp4 Text,,,,243013,0,758 )
    

    After examination, I thought that the sensible replacement was ""• ( 1 manual replacement )

    • Some lines contain text like ...,123,"abc def ""ghi jkl""",456,.... Like above, for these configurations , it’s sensible to replace the inner 2 double-quotes of the expression with the **bullet char, leading to the more simple syntax ...,123,"abc def •ghi jkl•",456,... !

    • So :

      • SEARCH ""([^"\r\n]+)""

      • REPLACE •\1•

    => 339 replacements

    • Now, the search of """ ( three consecutive double-quotes ) find 1 record, below :
    453668807094,"""Er... say what now?",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00019),,,narrator,narrator,238196,,20
    

    After examination, I thought that the sensible replacement was "• ( 1 manual replacement )

    • At last, the search of "" ( two consecutive double-quotes ) find 4 records, in 2 lines, below :
    6265,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioning vehicle.","EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",changed,,,,,Trunk\Mars/Data/Scenario/Mystery_6.lua(00173),Scenario Mystery 6 text,,,,223739,,3
    
    935811765411,"Damn... I... I did not think of this. Actually, I did, I just thought that... I give up. I'll behave."" <effect>Stolen funding was refunded.",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00016),StoryBit BadBoyGenius_Success Text,,,,238196,,140
    

    After examination, again, I thought that the sensible replacement was ( 4 manual replacements )

    Of course, Andrew, it would be better to have a look to these 6 lines manually changed, in order to verify if my choices were correct or no ;-))

    As a verification, the search of "{2,} ( any list of consecutive double-quotes ) does give 0 occurrence ;-)). Now, you’re sure that exiting double-quotes begins or ends a field and that exists, at least, one char between these double-quotes ;-))

    Your file, with correct syntax, has 16,323 lines + 1 header line


    Now, we turn to the main S/R replacements :

    • First, we’ll replace any comma, in zones between double-quotes, with a dummy char, non-existent in your file. I choosed the @ character

      • SEARCH (,"[^"\r\n]*),([^,"\r\n]*")

      • REPLACE \1@\2

    IMPORTANT You’have to click several times, on the Replace All button, till the message Replace All: 0 occurences were replaced. is displayed. So :

    = 3,483 + 1,446 + 705 + 390 + 232 + 137 + 89 + 54 + 39 + 27 + 19 + 17 + 8 + 4 + 3 + 3 + 0 = 6,656 replacements, after 17 clicks on the Replace All button

    • Then, we change any comma delimiter with an other dummy character ( I chosed the Pipe symbol | ) :

      • SEARCH ,

      • REPLACE |

    => 277,508 replacements

    • So, now, we can change our temporary @ in order to get, again, commas, inside the double-quotes zones :

      • SEARCH @

      • REPLACE ,

    => 6,656 replacements

    • And finally, we change our temporary bullet char, to get the original two double-quotes :

    • SEARCH

    • REPLACE ""

    => 684 replacements

    Et voilà !

    As a final check, let’s count the number of lines, containing 18 fields, separated with 17 new delimiters |, in our modified file, with the regex :

    COUNT ^([^|\r\n]*\|){17}[^|\r\n]*$

    Luckily, we get the expected result : 16,324 occurrences, which is the exact number of lines ;-))


    If you prefer, Andrew, here is my e-mail address, below. So, just e-mail me and I 'll send you back the NEW-English.csv file ;-))

    tguy.038@gmail.com

    Best Regards

    guy038

    P.S. :

    If you’d like some additional information on the regexes used, just tell me. Just enough for today ;-))



  • @andrew-staibuz, @peterjones, @dinkumoil and All,

    I’ve found a small error, relative to an unique line, beginning with 394318123243|"""Commander, the operation went as planned............, which contained, initially, 4 consecutive double-quotes (")

    I previously proposed to change the """" string with """,". However, in order that this line has exactly 18 fields, the correct replacement should be ""• !

    Of course, I also updated my previous post and some replacement numbers ;-))

    Cheers,

    guy038



  • Thank you everyone for the replay.
    I have followed The guy038 suggestion and seems works.


Log in to reply