How to change all invalid characters to spaces

skevich21

A client provides me with some data that I need to run on a database. Unfortunately, the data has invalid characters in it.
I need to change all invalid characters to spaces.
In the below sample text, I selected Encoding >>> Convert to ANSI
Then I chose Search >>> Replace
For “Find What” I pasted the invalid character: Â.
For “Replace With,” I typed a space.
Instead of being replaced with spaces, the characters were replaced with this: (it’s a light-orange dot, not visible here).

Here is some sample text before the Replace:
ALTER TABLEÂ AR.AR_CATEGORY_BUDGET_COPYÂ ADD COLUMNÂ line_numberÂ Â INTEGER defaultÂ 0;

Here is the text after the Replace. It has light-orange dots now, where the previous invalid character was. The orange dots are not visible here:
ALTER TABLE AR.AR_CATEGORY_BUDGET_COPY ADD COLUMN line_number INTEGER default 0;

Please, how can I convert the invalid characters to spaces?
Thank you!

Scott Sumner

@skevich21 said:

it’s a light-orange dot

This sounds to me like you have the visible whitespace feature turned on. If that’s the case, then what you are seeing truly are just plain spaces. Test by hitting the spacebar a few times in your text and see if it generates more of the orange dots, or something different…

PeterJones

To further prove Scott’s theory,

in a Search box (without replace), search for the space character (by typing the space in the Find What); it should highlight each of the orange dots in turn with each Find Next. If it does, I would be quite confident those are really spaces.
View > Show Symbol – you will likely see that Show Whitespace and TAB or Show All Characters is selected (if the latter, you’ll also see CR and LF in black boxes, which I’m sure you would have mentioned if you were seeing them). If one is selected, choose it again to de-activate it. I am betting that the orange dots will go away, and the spaces will look like spaces to you.

What surprises me, however, is that you didn’t mention seeing the orange dots in the other spaces that were in the original text. The question in my mind is then whether you did not notice them, or whether the original text contain some other character rather than a space, which I think is even more problematic.

I noticed you ran “convert to ANSI”… do you happen to know what encoding the data came to you in? It looks, given the placement in the SQL, that those would normally be newlines, and I’ve never seen an encoding where a newline mapped to Â when read as ANSI. Hmm, I wonder if that’s the code point that the specific editor it came from used to represent a “soft return” – a newline not meant to indicate a paragraph break.

Actually, it’s probably Unicode, in UTF-8 encoding, where the code points U+0080 to U+07FF are indicated by the two-byte binary sequence 110xxxxx 10xxxxxx: if the first 5 xes were 00010, then the first octet would be 0xC2, which would be interpreted (in ANSI encoding) as Â … but in such a circumstance, I would expect another strange character to follow, and you implied only one strange character. given that first octet, the it would be in the range U+0080 to U+00BF. Yes, this list shows that U+0085 is the “Next line (NEL)” character, which would make sense. But then I would have expected you to see “ÂENQ”, where the ENQ was in a black box.

If you’ve got the original, unedited file, you might want to try Settings > Preferences > Encoding ☑UTF-8 + ☑Apply to opened ANSI files and then re-open the original, and see what happens.

skevich21

@Scott-Sumner Yes, you are right, when I hit the space bar, more orange dots appear. And when I do a search on a space, the orange dots are “found.”

skevich21

@PeterJones Yes, you are right, when I search for a space, the orange dots are “found.”
I do have Show All Characters selected, and do see the carriage return/line feeds. Yes, you are right, when I turned off Show All Characters, the orange dots went away, and the spaces look like spaces again.
Your observance that the original text did not show orange dots where there are spaces is spot on. Even if I search for spaces, it does find the areas where there are not orange dots as spaces.
BUT, there are areas that look like spaces (but no orange dots) that the Find facility skips over, telling me those are the invalid characters. Those areas are also what appear to be complained about when I attempt to run this invalid code from my client.

I do not know what encoding the data came in from the client.

I do not have Settings > Preferences > Encoding, but I do have Encoding > Convert to UTF-8, so I highlighted all the data and tried that. I saved it, re-opened it, and it still has invalid characters in it.

I welcome any suggestions for how to change all invalid characters to spaces.

The parts that the Find facility skips (when searching for a space) are after the following words in the below: TABLE, COPY, number, default

ALTER TABLE AR.AR_CATEGORY_BUDGET_COPY ADD COLUMN line_number INTEGER default 0;

Thank you!!!

skevich21

HOW TO FIX:
The key is to switch it to HTML.

Paste the text into Notepad++
Set Language to HTML (Language > H > HTML)
Search for spaces. For the first “space” that it skips over, an invalid character, select and copy it.
Search and replace: For the Search for: Paste in the invalid character that it skipped. For the Replace: Type in a real space.

Scott Sumner

@skevich21

You are probably encountering a special “space” called a “no-break space”. See here for more info.

You likely don’t need the “html” step you mentioned.

skevich21

@Scott-Sumner That link you sent is spot-on.
\xa0 is what I could have put in the “search for”, and then replace it with a “real” space, if I do not want to convert to HTML.

Thanks!

guy038

Hello, @skevich21 and All,

I’m pretty sure that the Scott suggestion is the right one !

Now, when I’m confronted with a weird or an unknown ( to me ! ) character, I generally use this tiny on-line tool, below :

http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

By default, this tool supposes that you would type in the Unicode code-point of a single character. So :

Select, instead, the radio-button Character
Then, copy/paste an UNIQUE unknown character, in the Type in a representation of a Unicode character: field
And, simply, click on the Go button

=>

You should get the glyph and/or the name, if not printable, of this character, in the first two lines of the list ( Character and Character name )
The Unicode code-point of that character is displayed in the Hex code-point row
Note also that the hexadecimal sequence of bytes, used to represent this character in an UTF-8 encoded file, is shown in the Hex UTF-8 bytes row

Of course, you may enter any kind of data, corresponding with your radio-button choice !

If you want to search for that specific character, from within Notepad++ :

If the Normal search mode is selected, copy/paste that character in the Find what: zone ( Any Unicode char is allowed, even those outside the BMP ( Basic Multilingual plane )
If the Regular expression search mode is active, enter the regex \x{code-point}, where code-point stands for four hexadecimal digits, between \x{0000} and \x{FFFD}. Note that code-points, beyond FFFF, are NOT allowed !

BTW, @scott-sumner and @terry-r, I verified that, unfortunately, with the present version of our regex engine, the meta-char dot . does not match any character, with Unicode code-point over \x{ffff} :-(( Again an issue !.. … Though, I found out a work-around to localize these chars, outside the BMP, with the regex (?![\x{0000}-\x{ffff}])., which wrongly matches an empty string but still allows you to know where to find them :-))

But, as Scott often says, you’re just [A-Za-z] persons, who, probably, don’t have to care about the 137,374 characters of the latest v11.0.0 version of Unicode, released on June 2018, don’t you ? Refer to the link, below for further information :

http://www.unicode.org/versions/Unicode11.0.0/

Secondly, @skevich21, knowing the Unicode code-point of this character, I generally use the web link, below, to find out which Unicode block this character belongs to ( Just type in its code-point in the Find char by hex code field and click on the Go button )

http://www.unicode.org/charts/

Best Regards,

guy038

P.S. :

For instance, if you type in the code-point 12345, you’ll get the CUNEIFORM SIGN URU TIMES KI character, belonging to the Cuneiform block ( 12000-123FF ) and some info about it

You’ll even see the surrogate pairs ( two 16-bits words ), used in an UTF-16 encoded file, to represent any character, with code-point over \x{FFFF}, outside the BMP ( Basic Multilingual Plane )

And, if you go to the http://www.unicode.org/charts/ , you’ll see the exact representation of this cuneiform character !

PeterJones

As a single website alternative for the character->code-point and code-point->block translations:

paste the character at http://www.fileformat.info/info/unicode/char/search.htm
click on the name
the character page will tell you about it, including the name of the block
- click on that block name for more about that block
- click on the “fonts that support …” link to find a reasonable list of fonts that have a glyph for that character

guy038

Hello, @skevich21, @peterjones, @scott-sumner, @terry-r and All,

Peter, before posting I had tested the Fileformat.Info site, which does provide a great amount of information ;-))

However, I realized that it you try to paste a character, which has an Unicode code-point over \x{FFFF}, it does not find any character :-(( So, I preferred not to mention it, in my last post !

For instance, using the character, given as an example, at the end, your can type in the value 12345, at http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi? and click on the Go button

=> The white square 𒍅 should be displayed, in the Character line

Now, simply, select and copy this 𒍅 character
And paste it at http://www.fileformat.info/info/unicode/char/search.htm

After clicking on the Search button, you’ll notice that it answers : Search 𒍅 => O hits

Now move back to the http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi? site
- Choose the first radio button Character
- Enter any single character, for instance, the letter A and valid
- Finally, delete the A char and paste the character 𒍅, instead

=> After validation, it does find the expected character :-))

Of course, I’m pretty sure that there must be other websites, which properly manage these characters, either !

Cheers,

guy038