Finding multiple words inside extra quotes

TexasKCFan

Here is the results I get with the solution you gave me. The correct result along with incorrect results from the end of one line to the beginning of the next. I’m trying to figure out what to add so it won’t pick up the false results that include CRLF.

Terry R

@TexasKCFan said in Finding multiple words inside extra quotes:

Here is the results I get with the solution you gave me.

Sorry it was not a solution. I was just pointing out that your requirement to exclude certain characters had been coded incorrectly, and how it might be coded. I didn’t do any tests.

I haven’t been following this thread so don’t feel that I could provide one, others who have already presented some ideas might be the better ones to extend those solutions for your latest request.

Terry

TexasKCFan

My apologies Terry I thought you were Peter. Thank you for the input!

TexasKCFan

(,"[^\r\n][^"]*)("[\w\s]*)(")([^"]*",)

This is the solution that worked. I had to include the \r\n earlier in the line of code so it ignored the CRLF.
This is now the desired result.

Thank you @PeterJones and @Terry-R for your help!

guy038

Hello @texaskcfan, @Peterjones, @terry-r and All,

@texaskcfan, if I assume that :

Your list contains one record per line
The field separator is the comma, located right after the previous field and right before the next field
Any field is, either :
- Any text, between two double quotes, beginning a line, possibly preceded with blanks characters and followed with a comma, as "xxxxxx",
- Any text, between two double quotes, itself surrounded by two commas, as ,"xxxxxx",
- Any text, between two double quotes, preceded by a comma and ending a line or the very end of current file, as ,"xxxxx"CRLF

The following regex, containing a recursive pattern, will find all the zones between two commas, which contains an even number of " ( double-quote characters ) :

(?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z)

For instance, given this sample, that you’ll copy in a new tab :

"0","0","","","","","","","","","","","","",""

"Field1","Field2","Field3","Field4","Field5","Field6","Field7","Field8","123 Main "St AptD"","","Dallas","TX","12345","","","","","","","","","","","","","","","","","","","","","","",""

"Field1","Field2","Field3","Field4","Field5","Field6","Field7","Field8","This fie"ld is NOT correct","","Dallas","TX","12345","","","","","","","","","","","","","","","","","","","","","","",""


," abcde","","ijk   "123"," 987 "This is"a small""pie"""ce of"text for"  tests" !!","","abc"de"fgh"ij","12345"

" abcde","","ijk   "123"," 987 "This is"a small""pie"""ce of"text for"  tests" !!","","abc"de"fgh"ij","12345"


"","","",","","","",""

","","","","",""

"   """    """   """    "
"   " "    """   """    "

1234567890","","","","",""

      " abcde","","ijk   "123"," 987 "This is"a small""pie"""ce of"text for"  tests" !!","","abc"de"fgh"ij","12345"

Open the Mark dialog ( Ctrl + M )
Paste the above regex in the Find what: zone
Preferably, tick the Purge for each search option
Possibly, tick the Wrap around option
Select the Regular expression search mode
Click on the Mark All button

=> All the fields ,........, containing an even number of " ( thus any correct field ), are marked in red style. This means that any zone, still unmarked, contains an odd number of double quotes, probably indicating one " character, too many or too few ;-))

Note that, for a correct scanning process of the text, the regex marks from the comma, before each zone "....." till the ", right before the next comma !

You may, as well, use the Find dialog to visualize each correct field, one a a time !

Here is a snapshot of the sample, after the mark operation :

As you can see, all the fields non marked are incorrect in some ways and need examination.

Note, also, that it’s impossible build up a regex in order to get the opposite logic, i.e. which would match the general case of any zone, between two commas, containing an odd number of double quotes inside !

Best Regards

guy038

guy038

Hi, @texaskcfan, @Peterjones, @terry-r and All,

@guy038 said :

Note, also, that it’s impossible build up a regex in order to get the opposite logic, i.e. which would match the general case of any zone, between two commas, containing an odd number of double quotes inside !

Well… I was wrong :-( As usual , just when I woke up, I understood the way to do ! We simply must add the two control verbs sequence (*SKIP)(*F) at the end of the previous regex and add an alternative which selects all standard chars sequence till the nearest comma , character ;-)), giving the final regex :

MARK (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z) (*SKIP) (*F) | (?-s) .+? (?=,|\R|\z)

With the same sample as in my previous post, we get, this time :

Now, it’s even more easy to focus our attention on the problem areas !

BR

guy038

guy038

This post is deleted!

guy038

@texaskcfan, @peterjones, @terry-r and All,

I improved my regex in order to satisfy all the possible cases and get a coherent behavior :

MARK (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z) (*SKIP) (*F) | (?:^\h*|,) (?: [^"\r\n]+ (?=,|\R|\z) (*SKIP) (*F) | [^,\r\n]+? (?=,|\R|\z) )

So, to summarize, this regex will mark, from beginning of line, any possible comma and its next zone, till the nearest comma, excluded, or till the end of line / file, ONLY IF this zone contains an odd number of double quotes "

Just use this sample, below, to test the regex against !

# With 0, 1, 2, 3, 4, 5 or 6 DOUBLE quote(s) and WITHOUT any char between DOUBLE-quotes
#  => Matches when 1, 3 or 5 DOUBLE quote(s) ( INCORRECT zones ) :

,,
,",
,"",
,""",
,"""",
,""""",
,"""""",

# With 0, 1, 2, 3, 4, 5 or 6 DOUBLE quote(s) and with some WORD chars between DOUBLE-quotes
#  => Matches when 1, 3 or 5 DOUBLE quote(s) ( INCORRECT zones ) :
 
,,
,",
,"123456",
,"123"456",
,"12"34"56",
,"12"34"56"78",
,"12"34"56"78"90",

,ABCXYZ,
,"ABCXYZ,
,"ABCXYZ",
,"ABC"XYZ",
,"ABC""XYZ",
,"ABC"""XYZ",
,"ABC""""XYZ",

,ABCXYZ,
,ABCXYZ",
,"ABCXYZ",
,"ABC"XYZ",
,""ABCXYZ"",
,""ABC"XYZ"",
,"""ABCXYZ""",

# With an EVEN number of DOUBLE quotes ( 4 ) for the STARTING, MIDDLE and ENDING field of a record
#  => CORRECT zones, NOT matched :

"12"345678"90",
,"123"4567"890",
,"1234"56"7890"

# With an ODD number of DOUBLE quotes ( 3 ) for the STARTING, MIDDLE and ENDING field of a record
#  => INCORRECT zones, matched :

"12345"67890",
,"12345678"90",
,"12"34567890"

# WITHOUT any DOUBLE-quote => All are CORRECT zones, so NOT matched ( indeed ZERO is an EVEN number ! )

,
,,
ABCXYZ
ABCXYZ,
,ABCXYZ,
,ABCXYZ

And we get the picture :

Best Regards,

guy038

TexasKCFan

@guy038 I can’t express how much I appreciate this additional detail! One search string to identify all inconsistencies would be a HUGE time saver. Since many of the files I work with have a couple of hundred thousand lines, my process is to identify the lines with issues, copy them and send them back to the client so they can correct them.

One note on the structure of my data…a successful line is 38 fields in this structure:

"Field 1", "Field 2","Field 3"CRLF

Data wrapped in a set of double quotes is acceptable regardless of punctuation (a lot of foreign addresses have multiple periods and commas). The issues are caused by extra single or double quotes without the properly associated commas, incorrectly causing the fields to be misaligned. Either one word, or a group of words contains extra single or double quotes without the correctly placed comma creates extra fields. Single quotes can cause an also because the system’s looking for the non-existent matching quote.

"Field "1"",
"Fiel'd 1",
"Field 1 "but with additional spaces and characters" and extra set of quotes",

When I run (?x) (?:^\h*|,) (?: ( " (?: [^"\r\n,]++ | (?1) )* " ) )+? (?=,|\R|\z) (*SKIP) (*F) | (?-s) .+? (?=,|\R|\z) I see that I’m getting non-error results like:

,"123 Main St, Apt Q"

While this example is not really correct (Apt Q should be in the next field), it doesn’t cause an issue per se because the whole thing is correctly enclosed in one set of double quotes and there are no extra quotes.

I think the expected error output would be something like this:

"123 Main "St", Apt Q"
"123 "Main St, Apt Q""
"123 Main S't, Apt Q"

If every line starts with an opening quote, ends with a closing quote, followed by CRLF and all the quotes in between have matching quote with a comma delimiter then I have the correct number of fields and my only concern is when they put the data in the wrong fields which is beyond the scope of this post. :)

Is it possible to make the start and end of the search string double quotes instead of commas so we can identify extra quotes (single or double) that don’t belong?

I’m brand new to the regex world and I truly appreciate the assistance!

guy038

Hi, @texaskcfan, @peterjones, @terry-r and All,

Well, of course, my previous search is rather academic and not adapted to your own data. It just showed a way to search for an odd number of double quotes in fields defined as below :

A field begins after a double quote at beginning of line or after the string ","
This same field ends before a next string "," or before an end of line or the very end of current file

And my goal was to demonstrate that this kind of search can be achieved, only with recursive regexes. Indeed, if you lack this kind of regexes, you need to find out one regex which will match zones with 1 double quote ONLY, an other regex, which will match zones with 3 double quotes ONLY, and so on…

But, seemingly, it’s not exactly why you’re posting here !. However, I’m rather confused in some ways, by your last post :

When you say :

I think the expected error output would be something like this:

“123 Main “St”, Apt Q”
“123 “Main St, Apt Q””
“123 Main S’t, Apt Q”

Do you mean that, given a single record of 8 fields :

CRLFField_1","Field_2","Field_3 with ERROR 1","Field_4","Field_5","Field_6 with ERROR 2","Field_7","Field_8"CRLF

You expect this output :

.... PREVIOUS record ....
"Field_1","Field_2"CRLF
"Field_3 with ERROR 1"CRLF
"Field_4","Field_5"CRLF
"Field_6 with ERROR 2"CRLF
"Field_7","Field_8"CRLF
.... NEXT record ....

In other words do you plan to isolate erroneous fields ?

Note that, at this time of the discussion, I don’t care about the errors themselves ! We’ll track them, later, by appropriate regexes !

Secondly, does your sample text :

"123 Main "St", Apt Q"
"123 "Main St, Apt Q""
"123 Main S't, Apt Q"

Represent 3 fields, one per line, delimited with the outer " characters ?

Represent several fields per line. For instance, "123 "Main St, Apt Q"" would have two fields 123 "Main St and Apt Q", separated with the comma ?

Thirdly, if we consider any field, with delimiter = beginning or end of line or the string ",", can you show us the different types of strings, containing single and/or double quotes and/or commas, in between, which may occur and which are the ones that you would like to detect as errors ?
More generally, which further treatment do you expect, when meeting these errors ?

@texaskcfan, it’s important to note that regexes are extremely sensitive to real data used and, although I give some generic regexes for general purposes, in this forum, a regular expression is generally adapted to a specific user and for specific data ;-))

I probably miss a lot of things about your data and your work-flow :-( So, just enlighten me !

See you later,

Best regards,

guy038

P.S. : The best would be that you provides a part of your file, containing a fairly range of records, which identifies the zones to be detected as erroneous !

Finally, if your data is rather confidential, just send me an e-mail at :

TexasKCFan

@guy038 My apologies for the confusion. Thank you for the general knowledge for the community!

Given a single record of 8 fields, I would expect to see it exactly as you posted but minus the CRLF at the beginning and with an opening quote.

"Field_1","Field_2","Field_3 with ERROR 1","Field_4","Field_5","Field_6 with ERROR 2","Field_7","Field_8"CRLF

Yes, I plan to isolate erroneous fields. What I meant to say is these would be the three different examples of errors I would be able to identify in my data. My sample text represented three versions of errors I would find in a single field. A full line would be 38 fields in the format just included above.

“123 Main “St”, Apt Q” - Example of a single word incorrectly encased in double quotes `"St"`
“123 “Main St, Apt Q”” - Example of a phrase incorrectly encased in double quotes `"Main St, Apt Q"`
“123 Main S’t, Apt Q” - Example of an incorrect single quote `S't`

Yes, my goal is to identify the errors so I can copy the rows and bring them to the client’s attention so the can fix them. I am not correcting the errors, simply identifying them during my pre-processing QA. If I process the files with these errors, the records will not load so I’m attempting to provide additional value to the client. :-)

I will be able to use the solutions above to identify the issues. I will put together a range of records for you to review and post later.

Thank you so much for your help! :-))

prahladmifour

Hello,@TexasKCFan
Please try this code, To Finding multiple words inside extra quotes
Regex:

/^[^"]*("[^"]*"[^"]*)*"[^"]*search[^"]*"/i

I hope this code will be useful to you.
Thank you.

guy038

Hello, @texaskcfan, @peterjones, @terry-r and All,

While searching a solution, I realized that the comma delimiter can also be found inside fields themselves ! This is really problematic :(

So, in order to get a neat field delimiter, we should, temporarily, change the , delimeter by, let’s say, the ¤ character if this char is not used in your file ! From now on, I suppose that it’s the case :

Thus, we first change the field delimiter of your file with the simple normal S/R :

SEARCH ","

REPLACE "¤"

Note that you may replace the ¤ char with any char not used yet !

And, at the very end of the process, you’ll just perform this normal S/R to get the right syntax, again :

SEARCH ¤

REPLACE ,

Now, given your 3 error cases that you provided, here is my first try towards a complete solution ! I’m using the free-spacing mode (?x) for a better readability !

MARK (?x) (?:^"|"¤") \K ([^"'¤\r\n]*?) ("|') (?1) \2? (?1) (?="¤"|"\R|"\z) without the outer " chars marked

OR

MARK (?x) (?:^|"¤) \K " ([^"'¤\r\n]*?) ("|') (?1) \2? (?1) " (?=¤"|\R|\z) with the outer " chars marked

If we try these two regexes, against this sample text, below :

"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"123 Main "St", Apt Q"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"123 "Main St, Apt Q""¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"123 Main S't, Apt Q"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"

"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"123 Main "St, Apt Q"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"123 Main St, Apt Q""¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"123 Main S't, A'pt Q"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"

"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"
"123 "Main St, Apt Q""¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"123 Main "St", Apt Q"¤"Field_7"¤"Field_8"¤"Field_9"¤"123 Main S't, Apt Q"
"Field_1"¤"Field_2"¤"Field_3"¤"Field_4"¤"Field_5"¤"Field_6"¤"Field_7"¤"Field_8"¤"Field_9"¤"Field_10"

we get :

In that sample, I simply consider lines of 10 fields
In lines 4, 7 and 10, you’ll recognize your 3 error cases
In lines 14, 17 and 20, I deleted one double quote and added one single quote, which are also correctly detected as a possible error by the regex. But, If you do not need the detection of these extra cases, I won’t have any problem to modify the regex
In line 24, I added a record with the 3 error cases, together
From lines 29 to 50, results are identical to those above, but the outer double-quotes, containing fields, are also marked !

In this text, it’s very easy to move from one error’s zone to an other one :

Use the F2 and the Shift + F2 shortcuts to move from one erroneous record to the next or previous one
Use the Ctrl + 0 and Ctrl + Shift + 0 shortcuts, on the main keyboard, to move from one erroneous field to the next or previous one

You’ll remark, that I haven’t spoken yet, about a method to isolate all the erroneous fields, one per line. Indeed, @texaskcfan, I would like, first, to get your feeling about keeping records as they are ( excepted for the temporary field delimiter change ! ) and simply navigating between all error’s zones, with the shortcuts mentioned above !

Best Regards,

guy038

TexasKCFan

@guy038 This solution is perfect and returns all the errors I am looking for. Thank you!