Clean a file using Regex

Mauricio Salazar

Hi everybody, someone already help me with a Regex issue here, butI need toapply that solution into a comma delimited csv.

The following image shows the working solution:

Alt text

But the file is comma separated and this solution does not work:

Alt text

The problem is that when I am trying to load the file it gives me an error.

Any help will be appreciated

Best Regards

guy038

Hi Mauricio,

Seemingly, you would like to change, in your CSV file, each comma by an underscore, only if this comma does NOT occur in the zone, containing text ANYWORD or ANY. That is to say, from your picture, between the columns 26 and 45. Right ?

Well, here is, below, the general method for matching a single character, OUTSIDE an unique zone of text, beginning at column c1 and ending at column c2 :

So, let’s suppose your well delimited text, below :

abcd,04,,11111111,    22,ANYWORD ,ANYWORD,ANY,QZERTY,001,,Last Field,
fghi,02,,22222222,    22,ANY ,ANY, WORDS, ANY,AZERTY,999,,Last Field,
klmn,09,,33333333,    22,WORDS,ANY, WORDS,ANY,TEST_1,123,,Last Field,

and that the comma separator will be replaced by the underscore character ( _ )

Firstly, we’ll use two characters, NOT used yet, in your file, as the boundaries of the zone to exclude, from the search of the comma. I chose the characters #, as starting delimiter and @, as ending delimiter

To add them, temporarily, just use the following regex S/R :

SEARCH : ^(.{c1-1})(.{c2-c1+1})

REPLACE : \1#\2@

Of course, you must replace the expressions c1-1 and c2-c1+1, with their values, to obtain an integer number. In your case, as c1 = 26 and c2 = 45, we obtain the SEARCH regex : ^(.{25})(.{20})

After moving back to the beginning of the sample text, and clicking on the Replace All button, you should obtain the text below :

abcd,04,,11111111,    22,#ANYWORD ,ANYWORD,ANY@,QZERTY,001,,Last Field,
fghi,02,,22222222,    22,#ANY ,ANY, WORDS, ANY@,AZERTY,999,,Last Field,
klmn,09,,33333333,    22,#WORDS,ANY, WORDS,ANY@,TEST_1,123,,Last Field,

Note that the # boundary is located after the comma of the fourth field and that the @ boundary is located before the comma separator of the fifth field

Finally, we’ll use a second regex S/R that will replace the comma separator by the underscore character, ONLY IF this comma occurs OUTSIDE the zone, delimited by the # and @ boundaries. In the same time, it, also, deletes these temporary boundaries !

SEARCH : ,(?=.*#)|,(?!.*@)|(#|@)

REPLACE : (?1:_)

This time, after clicking on the Replace All button, we get the final text, below :

abcd_04__11111111_    22_ANYWORD ,ANYWORD,ANY_QZERTY_001__Last Field_
fghi_02__22222222_    22_ANY ,ANY, WORDS, ANY_AZERTY_999__Last Field_
klmn_09__33333333_    22_WORDS,ANY, WORDS,ANY_TEST_1_123__Last Field_

As expected, the comma has NOT been replaced, if located between column 26 and 45, included and the temporary boundaries # and @ has been deleted, too !

Notes :

The (?=.*#) is a positive look-ahead, which tests if exists, further on the current line, a # character
The (?!.*@) is a negative look-ahead, which tests if NOT exist, further on the current line, a @ character
Remark that the two boundaries, # and @, define, automatically, three zones :
- A starting zone, where the look-ahead (?=.*#) is TRUE and the look-ahead (?!.*@) is FALSE
- A middle zone, where the look-ahead (?=.*#) is FALSE and the look-ahead (?!.*@) is FALSE, too
- An ending zone, where the look-ahead (?=.*#) is FALSE and the look-ahead (?!.*@) is TRUE
So, the part ,(?=.*#) OR the part ,(?!.*@) look for a comma, ONLY IF the condition, of its following look-ahead, is TRUE
Finally, the part (#|@) also looks for, either, the # or @ character, which have to be deleted

The replacement regex is named a conditional replacement. Its general form is (?nXXXXX:YYYYY) and means :

If group n exists, then replaces the matched search by the XXXXX part
If group n does NOT exist, then replaces the matched search by the YYYYY part
The part XXXXX and YYYYY may contain any character, escape sequence, group, or, even, an other conditional replacement
The part XXXXX and/or YYYYY may, also, be EMPTY
In our case, the replacement regex (?1:|) means :
- If group 1 exists, which stands for a boundary match, # or @, deletes it as the part XXXXX does NOT exist
- If group 1 does NOT exist, then a comma, which satisfies the condition of one of the two look-aheads, has been matched and, therefore, must be replaced by the YYYYY part of the conditional replacement. That is to say by an underscore character, _

Best Regards

guy038

guy038

Hi all,

In my previous post, I gave a general method to replace a specific character by an other, everywhere in lines of a delimited text, except for a range, between column c1 and c2. I now give you an extension of that method to SEVERAL fixed zones to exclude !

I mean :

^---------- Zone 1 to exclude ------------ Zone 2 to exclude -------------------- Zone 3 to exclude ------------$

So, let’s suppose the original text, below :

abcd,04,,11111111,    22,ANYWORD ,ANYWORD,ANY,QZERTY,001,,5555,,AN,Y ANY,pqrst,00x,ANYWORD ,ANYWORD,9A9   ,Last Field,
fghi,02,,22222222,    22,ANY ,ANY, WORDS, ANY,AZERTY,999,,6666,,ANY AN,Y,uvwxy,01y,ANY ,ANY, WORDS,,7Z3   ,Last Field,
klmn,09,,33333333,    22,WORDS,ANY, WORDS,ANY,TEST-1,123,,7777,,ANY,,ANY,zabcd,02z,,ORDS,ANY, WORDS,3H5   ,Last Field,

I defined 3 zones to exclude, where the comma character will NOT be changed, while the S/R process :

The zone 1, which starts at column 26 and ends at column 52 => S1= 26 and E1 = 52
The zone 2, which starts at column 65 and ends at column 72 => S2= 65 and E2 = 72
The zone 3, which starts at column 84 and ends at column 99 => S3= 84 and E3 = 99

As previously explained, we, temporarily, add the # or @ boundaries, in order to delimit these 3 zones, with the general S/R, below :

^(.{S1-1})(.{E1-S1+1})(.{S2-E1-1})(.{E2-S2+1})(.{S3-E2-1))(.{E3-S3+1})..............(.{Sn-En-1-1})(.{En-Sn+1})

With the given values of S1 through E3 above, we get the following S/R :

SEARCH : ^(.{25})(.{27})(.{12})(.{8})(.{11})(.{16})

REPLACE : \1#\2@\3#\4@\5#\6@

which gives us the delimited text, below, with the boundaries :

abcd,04,,11111111,    22,#ANYWORD ,ANYWORD,ANY,QZERTY@,001,,5555,,#AN,Y ANY@,pqrst,00x,#ANYWORD ,ANYWORD@,9A9   ,Last Field,
fghi,02,,22222222,    22,#ANY ,ANY, WORDS, ANY,AZERTY@,999,,6666,,#ANY AN,Y@,uvwxy,01y,#ANY ,ANY, WORDS,@,7Z3   ,Last Field,
klmn,09,,33333333,    22,#WORDS,ANY, WORDS,ANY,TEST-1@,123,,7777,,#ANY,,ANY@,zabcd,02z,#,ORDS,ANY, WORDS@,3H5   ,Last Field,

Then, running the second regex S/R, below :

SEARCH : ,(?=[^@]*#)|,(?![^#]*@)|(#|@)

REPLACE : (?1:_)

we obtain the final text :

abcd_04__11111111_    22_ANYWORD ,ANYWORD,ANY,QZERTY_001__5555__AN,Y ANY_pqrst_00x_ANYWORD ,ANYWORD_9A9   _Last Field_
fghi_02__22222222_    22_ANY ,ANY, WORDS, ANY,AZERTY_999__6666__ANY AN,Y_uvwxy_01y_ANY ,ANY, WORDS,_7Z3   _Last Field_
klmn_09__33333333_    22_WORDS,ANY, WORDS,ANY,TEST-1_123__7777__ANY,,ANY_zabcd_02z_,ORDS,ANY, WORDS_3H5   _Last Field_

As expected, all the commas, located from column 26 till column 52, from column 65 till column 72 and from column 84 till column 99, have NOT been changed into an underscore character !

Notes :

In comparison to the previous regexes, only the look-aheads of the second S/R, are slightly different :
- The positive look-ahead (?=[^@]*#) verifies that, from the cursor location, a # character, can be found further, on the current line scanned, without any @ character, between the cursor location and the # location
- The negative look-ahead (?![^#]*@) verifies that, from the cursor location, a @ character, cannot be found further, on the current line scanned, without any # character, between the cursor location and the @ location

Cheers,

guy038