Convert quote marks straight to curly on html pages

Dazel andpointcom

Hello fellow Notepad++ Users,

I’m using Notepad++ as an HTML editor, and I’ve been very lax about my typography!
Do you know of an easy way to convert every straight quotation marks (“this”) to curly ones (“this”), but only in plain text sections of course (excluding HTML tags)?
I have a thousand .HTML files to edit.

I see Notepad++ is able to distinguish plain text from code, because it colors it differently. If there was a way to search/replace only “non-colored” text, I could manage. There might be a more specific tool for that purpose thought.

Thank you.

PeterJones

@Dazel-andpointcom said in Convert quote marks straight to curly on html pages:

Do you know of an easy way to convert every straight quotation marks (“this”) to curly ones (“this”), but only in plain text sections of course (excluding HTML tags)?

“Easy” is in the eye of the beholder.

I have a thousand .HTML files to edit.

You really should be using a content management system, rather than using a text editor to edit all the pages one-at-a-time or even in bulk actions like Notepad++'s Find in Files feature.

I see Notepad++ is able to distinguish plain text from code, because it colors it differently. If there was a way to search/replace only “non-colored” text, I could manage. There might be a more specific tool for that purpose thought.

The syntax highlighting feature is not coupled with the search-and-replace feature, so it’s not possible to say “replace X with Y, but only in the text that the lexer highlights as Language: HTML > Style: DEFAULT” (that is, the “plain text” in your HTML, which is rendered with whatever color/style **Settings > Style Configurator > ** is set to for you). To get the search to be lexer-aware, the “Lexilla” library that Notepad++ uses for syntax highlighting would have to be written differently, and Notepad++ would have to hook up its search engine library in a complicated way with the Lexilla library; I doubt that the developers of either of those libraries would add the features necessary to those to be able to handle easily tying together those two features.

However, regular expressions (aka “regex”, a fancy mode of search that can be enabled in Notepad++) can do quite a lot of stuff, including “search for X and replace with Y, but only when between C and D”. And in fact, our resident regex guru @guy038 has written up a generic regular expression formula that replaces only in a specific zone or a variant of that formula that allows for nested zones like HTML/XML allow. In those formulas, FR (“find regex”) corresponds to the text that I called X. RR (“replacement regex”) corresponds to Y, BSR (“Begin Search-Region Regex”) corresponds to C, and ESR (“End Search-Region Regex”) corresponds to D.

So if you knew that your straight quotes that you wanted to convert to curly quotes were always inside paragraph tags, I would use the variant customized for HTML tags, and use p wherever the placeholder TAG showed up in that customized formula.

However, if your quotes are inside a variety of tags (like P, SPAN, DIV, H2, …), then it might be easier to use the non-customized version, and use BSR as > and ESR as </ (because plain text will be between the > of the opening tag and the </ of the corresponding closing tag).

A note on your find: since your replacements “ and ” are two separate characters, but the ASCII " you are searching for is just a single character, you’re FR will have to be smarter than just " . I might suggest a two step process: one where you use FR = "(?=\w) (which says “find an ASCII " character followed by a word (A-Z,0-9, _)”) and RR = “ (because a quote mark immediately followed by alphanumeric is likely the start of your quote); and a second where FR = (?<=\S)" (which says look for anything that’s not a space/newline before a quote, which will allow a letter or number or punctuation right before the " to indicate end-of-quote) and RR = ” to end quotes.

The regular expression mode is available in the Find in Files interface, so you could use it on your thousands of HTML files. However, I suggest running it as a normal replacement in a single file, first. And make sure you have a backup of your thousands of files before trying anything recommended by anyone on the internet (you, not we, are responsible for not losing your own data).

Coises

Just a couple points to add to Peter Jones’ reply:

Using > and </ as delimiters would include the content of <SCRIPT> and <STYLE> tags… so if you have those tags, they will need to be excluded. No doubt there’s a way to add that qualification to the regex, but I’ll leave those details to the masters.

If you want to do curly quotes, you probably want curly apostrophes and single quotes (as used inside double quotes), too — the aesthetics of having one and not the other would surely be more jarring than having them all straight. Keeping left single quotes straight from right single quotes and apostrophes is, as I recall, even trickier than sorting out double quotes.

I would be very surprised if there is a 100% accurate way of doing this algorithmically. If you care enough to care about curly quotes as opposed to straight quotes, you’re going to need to proofread every changed page. Consider that before you decide whether it’s worth doing.

Mark Olson

While I think that there are some simple cases where HTML can be altered effectively with regex, I’ll just copypasta this without further comment.

dont parse html with regex.PNG

Dazel andpointcom

Haha! Thank you for the detailed answers. I was hoping for a simpler solution than regex, as I understand very little about it. However, it went surprisingly well…

It’s very basic, not 100% accurate solution but close enough. I edited by hand the dozen that were left next to a HTML tag (<a>, <li>, <i>).
I’ll leave it there in case somebody else has the same need in the future:

Step 0: Make backup!

Step 1:
\s" replaced by \s“ (catches all opening quotes after a space, so no HTML “code” should be hit, and replaces them)

Step 2:
“([^“]*)” replaced by “\1” (replaces closing quotes)

I use the French version:

Step 1:
\s" replaced by \s«

Step 2:
«([^“]*)” replaced by «\1 »

And for the apostrophes:
([a-zA-Z]|[à-ü]|[À-Ü])'([a-zA-Z]|[à-ü]|[À-Ü]) replaced by \1’\2 (every apostrophe between two letters, to avoid hitting things in javascript, accented letters included for French)
Then by hand, I replaced the few that were left (no regex):
'<i> --> ’<i>
'<a --> ’<a

@ Mark Olson
No idea what a parser is, nor XML… Anyway my typography is saved now.