Wordcount splitting words on apostrophes



  • Hello!

    My Notepad++ has been giving me weird results for wordcount for a while now. I finally figured out today that the issue is apostrophes–the word counter seems to think that apostrophes are whitespace. I have a document with only the word it’s in it, and the program says there are two words instead of one.

    Is there any setting I can change to fix this, or should I file an issue in GitHub? I’m not sure if it’s just doing this because I messed up a setting somewhere or what. I just updated to the latest version in case that was the problem, but I am still seeing this.

    Thanks!



  • @Stewart-Baker

    It does this because it doesn’t consider an apostrophe to be a “word character”. You can certainly do a github issue if you’d like to.

    You might try this as an alternate method for obtaining word count:

    Pull up the Find window and do a regular expression search mode search for \S+ by pressing the Count button. Here’s a demo of it counting your it's as 1 word instead of 2:

    d997d2bf-25c0-4b43-ac32-7cdba1ae46c4-image.png

    BTW, \S+ may not be appropriate in all instances. What it means is: “Consider a match to be the longest string you can find between traditional whitespace characters”.



  • Hello, @steward-baker, @alan-kilborn and *All,

    Here is an alternative to the @alan-kilborn’s solution :

    Open the Find dialog ( Ctrl + F )

    • SEARCH [\w'’]+

    • Tick the Wrap around option

    • Un-tick all the squared box options

    • Select the Regular expression serch mode

    • Click on the Count button or use the default Alt + T shortcut


    The regex [\w'’] forces the regex engine to consider the two Unicode characters APOSTROPHE ' ( \x{0027} ) and RIGHT SINGLE QUOTATION MARK ( \x{2019} ) as word char, as well !

    In addition, you may feel interesting to have a look to this other post of mime, about the Summary feature, especially the first part :

    https://community.notepad-plus-plus.org/post/59069

    Best Regards,

    guy038



  • Just for reference the “word count” function (found via View menu > Summary… and then looking at Words: in the output) in Notepad++ uses this regular expression to determine what is a word:

    [^\x20\t\\.,;:!?()+\r\n\-\*/=][{}&~"’`|@$%<>\^]+

    Note that because this is a regex of the form [^...] that this list of characters is saying what is NOT a word character rather than what IS a word character.

    So we basically have these characters which will terminate counting some bit of text as a “word”:

    \x20\t\\.,;:!?()+\r\n\-\*/=][{}&~"’`|@$%<>\^

    I took some liberties with the \x20 and the \t by changing them from literal space and tab character (so that they are more easily seen).

    Anyway, I see an apostrophe in there, so that’s what is causing it's to be counted as two words.

    But, how does this handle UTF-8 characters?
    If we copy the “it’s” from the OP (the bolded “it’s”) and paste it into a Notepad++ tab and then run the “word count” function, we see that it shows Words: 1. Success? Yes, but really No. :-(
    This is a UTF-8 special apostrophe, and it isn’t accounted for in Notepad++'s expression for what is not a word character.

    Perhaps the conclusion to be drawn is that Notepad++ is not a great counter of words. :-)



  • Thanks for all the comments!

    Perhaps the conclusion to be drawn is that Notepad++ is not a great counter of words. :-)

    This does, alas, seem to be the conclusion… :) I will live!

    (ETA: I didn’t realize it had turned my non-smart apostrophes into the special ones in the forum post. In the text files, I just use standard apostrophes. Oh well!)


Log in to reply