How to locate text separators and replace them?

PeterJones

And as far as @guy038 's processing: I think the better first step would be to merge any line that doesn’t start with a 12-digit number with the line before it (using your magic never-used character… some unused punctuation/symbol, or perhaps some rare unicode character); once you do that, all lines should have the right number of commas… Though trying to write a regex that can distinguish between the commas of valid separators and the commas embedded in strings (even after the multiline merge), would be quite difficult. Though if anyone can, you could. :-)

dinkumoil

@Andrew-Staibuz said in How to locate text separators and replace them?:

so when a import the file in excel the result will be

Since you want to process the data with Excel anyway, why do you not import your original CSV file into Excel? When you choose the right method to do that, you will be guided through that process by an Excel wizzard, which can be told to use the comma as field separator. It also allows to choose UTF-8 as the character encoding of your source file.

[EDIT]
It is also possible to script this task, have a look at my CSV Viewer script. It lacks the part of saving a copy of the input file with different field delimiters but that’s no difficult job, Google gave me for example this result.
[/EDIT]

Andrew Staibuz

Thanks everyone for the suggestion!!!

guy038

Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

Aaaaaaah, Peter, how silly I am :-( I’ve just forgot that a standard line could be split on several lines !. So, after deleting some EOL, I ended up with a neat file of 16,324 lines, which contains 13,696 standard lines (with 17 comma delimiters )

Of course, the final file does not contain lines, with less than 17 commas. However, it remains, as expected, 2,627 lines which contain possible commas in zones, delimited between 2 double-quotes !

So, I’m still inverstigating for the way to just replace comma delimiters, outside these double-quotes delimited zones. But, Andrew, I am quite confident that I can come up with the solution very soon ;-))

See you later,

guy038

Andrew Staibuz

I Hope you could find a solution soon.
Meanwhile I have chosen to open the csv with Execel using the import data function and setting up the parametres.
Then manually apply the correction needed.

PeterJones

I wonder if you could use a different sequence:

open with excel
export to a new CSV, but with a different separator character (see this page for instructions to change what separator character Excel will use for exporting CSV). Don’t forget to change the separator character back when done, in case it messes up some other program
open the file in Notepad++ to do other edits, with commas now only existing in the real data, rather than as separators
make your edits
since your separator character was unique, you should just be able to change that character back to comma inside Notepad++
assuming you didn’t play around with the nesting of quotes, your file should still be valid CSV when you’re done

guy038

Hi, @andrew-staibuz, @peterjones, @dinkumoil and All,

Perfect, Andrew ! I managed to modify your file and get normal records, delimited in 18 columns, with 17 pipe chars ( | ), while keeping the commas in the areas between double quotation marks

As I explained in my short previous post, I first will delete some exedentary EOL chars in order to get complete lines.

Note that all S/R are done with :

The option Regular expression search mode
The option Wrap Around
A click on the Replace All button

Here are the different steps :

First, any blank lines or lines with a single space char are replaced with a single space char
- SEARCH \R\x20?\R
- REPLACE \x20

=> 2,840 replacements

Now, if any line contains less than 17 commas, we changed its EOL char with a single space char
- SEARCH ^\d+,([^,\r\n]*,){0,15}[^,\r\n]+\K\R
- REPLACE \x20

=> 243 replacements

Then, if any EOL char is NOT followed with a number of digits and a comma ( Beginning of a standard line ) NOR with the very end of file, we changed this EOL, again, with a single space char
- SEARCH \R(?!\d+,|\Z)
- REPLACE \x20

=> 292 replacements

At this point, we could think that your file is quite neat ! Yet, few weird things remain, concerning consecutive double-quotes ( " ) sequences :-(( So, to simplify the problem, I thought that it was better to replace, in some particular cases, two consecutive " chars with a temporary dummy character. I chose the bullet character • ( Unicode \x{2022} )

For instance :

The search of """" ( four consecutive double-quotes ) find 1 record, below :

394318123243,"""Commander, the operation went as planned. We produced vast quantities of the Water of Peace and placed them in them water extractors. When they workd, this water of Piece entered them undergraund deposits. It didnt work purfect as mars wata supply is veri lou but it stil afekted a lots of peoples. We fil gud as we never biin frends liek dis befor. All is wel comander no wore!!!!!"""" It seems this Water of Peace reacted chemically with another element in the Martian waters and now not only affects the natural aggressiveness of human beings, but it also drastically reduces their overall intelligence.  <effect><count_idiots> Colonists gain the Idiot and the Enthusiast traits. Mars has gained <resource(terr_water,'WaterTP')>. The genius loses Sanity.",,,,new,,,,,Trunk\Mars/Dlc/armstrong/Presets/StoryBit/MartianPeace_FollowUp4.lua(00032),StoryBit MartianPeace_FollowUp4 Text,,,,243013,0,758 )

After examination, I thought that the sensible replacement was ""• ( 1 manual replacement )

Some lines contain text like ...,123,"abc def ""ghi jkl""",456,.... Like above, for these configurations , it’s sensible to replace the inner 2 double-quotes of the expression with the **bullet • char, leading to the more simple syntax ...,123,"abc def •ghi jkl•",456,... !
So :
- SEARCH ""([^"\r\n]+)""
- REPLACE •\1•

=> 339 replacements

Now, the search of """ ( three consecutive double-quotes ) find 1 record, below :

453668807094,"""Er... say what now?",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00019),,,narrator,narrator,238196,,20

After examination, I thought that the sensible replacement was "• ( 1 manual replacement )

At last, the search of "" ( two consecutive double-quotes ) find 4 records, in 2 lines, below :

6265,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",,"EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioning vehicle.","EsoCorp is having obvious issues with their mysterious rovers as they seem to be malfunctioning one after the other. Instead of admitting the faults in their predictions about the rough Martian conditions, the EsoCorp team has sent us a message which, quote, ""gives the Martian Colony Command the rare privilege of sending one of our more analytical vehicles to look into the most certainly small miscalculation that causes these temporary annoyances.” End quote. Furthermore - they go on to remind us that, by choosing to analyze their broken vehicles, we have an understanding not to dig beyond any obvious logistical shortcomings in their design and agree to full and utter discretion. <effect>An Anomaly has appeared next to a malfunctioned vehicle.",changed,,,,,Trunk\Mars/Data/Scenario/Mystery_6.lua(00173),Scenario Mystery 6 text,,,,223739,,3

935811765411,"Damn... I... I did not think of this. Actually, I did, I just thought that... I give up. I'll behave."" <effect>Stolen funding was refunded.",,,,new,,,,,Trunk\Mars/Data/StoryBit/BadBoyGenius_Success.lua(00016),StoryBit BadBoyGenius_Success Text,,,,238196,,140

After examination, again, I thought that the sensible replacement was • ( 4 manual replacements )

Of course, Andrew, it would be better to have a look to these 6 lines manually changed, in order to verify if my choices were correct or no ;-))

As a verification, the search of "{2,} ( any list of consecutive double-quotes ) does give 0 occurrence ;-)). Now, you’re sure that exiting double-quotes begins or ends a field and that exists, at least, one char between these double-quotes ;-))

Your file, with correct syntax, has 16,323 lines + 1 header line

Now, we turn to the main S/R replacements :

First, we’ll replace any comma, in zones between double-quotes, with a dummy char, non-existent in your file. I choosed the @ character
- SEARCH (,"[^"\r\n]*),([^,"\r\n]*")
- REPLACE \1@\2

IMPORTANT You’have to click several times, on the Replace All button, till the message Replace All: 0 occurences were replaced. is displayed. So :

= 3,483 + 1,446 + 705 + 390 + 232 + 137 + 89 + 54 + 39 + 27 + 19 + 17 + 8 + 4 + 3 + 3 + 0 = 6,656 replacements, after 17 clicks on the Replace All button

Then, we change any comma delimiter with an other dummy character ( I chosed the Pipe symbol | ) :
- SEARCH ,
- REPLACE |

=> 277,508 replacements

So, now, we can change our temporary @ in order to get, again, commas, inside the double-quotes zones :
- SEARCH @
- REPLACE ,

=> 6,656 replacements

And finally, we change our temporary • bullet char, to get the original two double-quotes :
SEARCH •
REPLACE ""

=> 684 replacements

Et voilà !

As a final check, let’s count the number of lines, containing 18 fields, separated with 17 new delimiters |, in our modified file, with the regex :

COUNT ^([^|\r\n]*\|){17}[^|\r\n]*$

Luckily, we get the expected result : 16,324 occurrences, which is the exact number of lines ;-))

If you prefer, Andrew, here is my e-mail address, below. So, just e-mail me and I 'll send you back the NEW-English.csv file ;-))

Best Regards

guy038

P.S. :

If you’d like some additional information on the regexes used, just tell me. Just enough for today ;-))

guy038

@andrew-staibuz, @peterjones, @dinkumoil and All,

I’ve found a small error, relative to an unique line, beginning with 394318123243|"""Commander, the operation went as planned............, which contained, initially, 4 consecutive double-quotes (")

I previously proposed to change the """" string with """,". However, in order that this line has exactly 18 fields, the correct replacement should be ""• !

Of course, I also updated my previous post and some replacement numbers ;-))

Cheers,

guy038

Andrew Staibuz

Thank you everyone for the replay.
I have followed The guy038 suggestion and seems works.

Patrick Chatelle

This seemed like it should be top of list, but here goes

I need to replace comma “,” with a pipe “|”, but only if a comma immediately follow a 2 digit decimal.

So
,49.30,49.30,.00,.00,.00,
needs to be
|49.30|49.30|.00|.00|.00|

notepad++ (Regular expression radio button clicked)
in the “Find what:” box -> (?:[.\d\d]{0},)
in the “Replace with:” box -> |

add “\d” for more decimals.