I want to sort text phrases by their number of words



  • I would like the program to be able to select or identify a sentence automatically (from one point to the next or from one point to the question mark or to the exclamation mark) and to order each sentence one by the number of words that each one has. So that the sentence with fewer words would be at one end and the one with more words at the other.

    For exemple:

    Normal text:
    The day was very difficult today. I hope the next few days get better. Do you hope the same? I hope you hope the same as me!

    Sorted text:

    1. Do you hope the same?
    2. The day was very difficult today.
    3. I hope the next few days get better.
    4. I hope you hope the same as me!

    it could also be:
    Do you hope the same? The day was very difficult today. I hope the next few days get better. I hope you hope the same as me!

    I don’t know how to do this, can someone help me?

    Thanks in advance



  • You cannot do it natively in Notepad++, with only builtin tools, in the truly generic situation you described. I cannot think that it would be a general-purpose enough sequence of events that any text editor includes it natively or even as a plugin.

    The two biggest problems:

    1. Defining beginning/end of a sentence reasonably: You can find periods with regular expressions (regex). But unfortunately, while periods can be used to end sentences, they can also be used for abbreviations both before and in between letters (Dr. vs M.D.), and many other places that are not the end of the sentence. Unlike when I was a kid, in modern text files, you cannot rely on “period space space” being unambiguously “end of sentence”, since many text files and style guides use single spaces after sentence-ending periods. And don’t get me started on parsing end-of-sentence when quotations are involved. A well-trained A.I. (or a reasonably-educated human) can usually get it right, but there isn’t a reasonable regex out there which will incorporate all the nuances. (The author said, “I hope Dr. Bob lives on Private Dr. in the same town where Harry Potter lives.” The programmer asked, “Can you see why this would be hard to parse in regex?”)
    2. Counting words in a group of text is not a feature built into Notepad++.

    So, even if you make simplistic assumptions and can get a regex to parse all of your sentences, you still cannot count words in each of those sentences without invoking a programming language – at which point it becomes mostly a programming challenge.

    This is not a code-writing service. This is not a general programming forum for asking programming questions. So from that perspective, it’s not on-topic for a Notepad++ forum.

    However, I had an idea while writing this up of a way to do it fully inside Notepad++, given some limiting assumptions:

    1. A period, question mark, or exclamation point are the only valid sentence enders.
    2. Those three characters are only at the end of a sentence and never inside a sentence (so no abbreviations).
    3. A “word” is defined as any sequence of non-whitespace characters (so “code-writing” is one word)
    4. There are no quotations.
    5. No sentence will have more than 10 words

    Sequence:

    1. Edit > Select All (^A)
    2. Edit > Line Operations > Join (^J)
    3. Search > Replace
      Find What: (?<=[\?\!\.])\s*
      Replace With: \r\n
      Search Mode = regular expression (assumed from here on out)
      Hit Replace All
    4. Search > Replace
      Find What: ^(\S+\h*)((?1))?((?1))?((?1))?((?1))?((?1))?((?1))?((?1))?((?1))?((?1))?$
      Replace With: (?{10}10\::(?{9}09\::(?{8}08\::(?{7}07\::(?{6}06\::(?{5}05\::(?{4}04\::(?{3}03\::(?{2}02\::(?{1}01\::()))))))))))$0
    5. Edit > Line Operations > Sort as Integers Ascending
    6. Search > Replace
      Find What: ^\d{2}:
      Replace With: empty string
    7. Edit > Select All (^A)
    8. Edit > Line Operations > Join (^J)

    START =

    The day was very difficult today. I hope the next few days get better. Do you hope the same? I 
    hope you hope the same as me! Hello, world! This sentence is just ten teeny-tiny words long, 
    silly goose! No?
    

    After #3 =

    The day was very difficult today.
    I hope the next few days get better.
    Do you hope the same?
    I hope you hope the same as me!
    Hello, world!
    This sentence is just ten teeny-tiny words long, silly goose!
    No?
    

    After #4 =

    06:The day was very difficult today.
    08:I hope the next few days get better.
    05:Do you hope the same?
    08:I hope you hope the same as me!
    02:Hello, world!
    10:This sentence is just ten teeny-tiny words long, silly goose!
    01:No?
    

    after #5 =

    
    01:No?
    02:Hello, world!
    05:Do you hope the same?
    06:The day was very difficult today.
    08:I hope the next few days get better.
    08:I hope you hope the same as me!
    10:This sentence is just ten teeny-tiny words long, silly goose!
    

    after #6 =

    
    No?
    Hello, world!
    Do you hope the same?
    The day was very difficult today.
    I hope the next few days get better.
    I hope you hope the same as me!
    This sentence is just ten teeny-tiny words long, silly goose!
    

    after #8 =

     No? Hello, world! Do you hope the same? The day was very difficult today. I hope the next few days get better. I hope you hope the same as me! This sentence is just ten teeny-tiny words long, silly goose!
    

    --------------------

    Quickie explanation of the regex in #4: look for 1 to 10 “words” that are made up of non-space followed by zero or more space separators. In the replacement, use the conditional replacement: if it found #10, prefix with 10:, else 09:, else 08:, … else 01: else if it doesn’t match, don’t prefix at all.

    If you wanted to match more than 10 “words”, you’d have to have more matches in the FIND, and more nested conditionals in the REPLACE. I went to 10 as a proof of concept. (I’m sure yours goes to 11.)



  • Hello @thunderdog, @peterjones and All,

    If we assume that :

    • A sentence begins to the first non-blank character till, either, a dot, a question mark or an exclamation mark

    • The last sentence of a line also ends with a dot, a question mark or an exclamation mark

    • Your text does not contain any tabulation ( \t ) char NOR the sharp # character

    Here is my regex solution !


    Taking your example, below, as the input text :

    The day was very difficult, today. I hope : the next few days get better. Do you hope the same? I hope you hope the same; as me!
    

    Then, the following regex S/R, after TWO successive clicks on the Replace All button, with the Regular expression option ticked :

    SEARCH (?-s)\w+[^\w\r\n]*(?=.*\t)|\h*(.+?[.?!])(?!#)

    REPLACE ?1\1\t\1#\r\n:#

    will change your text into :

    ######	The day was very difficult, today.#
    ########	I hope : the next few days get better.#
    #####	Do you hope the same?#
    ########	I hope you hope the same; as me!#
    

    After a classical sort ( Edit > Line Operations > Sort lines lexicographically ascending ), you get :

    #####	Do you hope the same?#
    ######	The day was very difficult, today.#
    ########	I hope : the next few days get better.#
    ########	I hope you hope the same; as me!#
    

    Finally, the simple regex S/R, below, get your expected text !

    SEARCH #+|\t

    REPLACE Leave EMPTY

    Do you hope the same?
    The day was very difficult, today.
    I hope : the next few days get better.
    I hope you hope the same; as me!
    

    I guess you’ve already grasped the concept ! Next time, we can dissect this regular expression and see how it works ;-))

    Note, also, that the two working characters ( \t and # ) can be changed, independently, if necessary.

    Ah, also, regarding the first regex S/R, I made sure that a 3rd press, on the Replace All button, will not find any more occurrences !

    See you later !

    Best Regards,

    guy038



  • @guy038 said in I want to sort text phrases by their number of words:

    Wow. I take back “You cannot do it natively in Notepad++”, because yours fits the original spec without the quantity assumption that I made. Great job!

    and @guy038’s

    [.?!]

    vs my

    [\?\!\.]

    As I said recently, I tend to over-escape characters. I thought I had improved, but… there you go. ;-(

    Next time, we can dissect this regular expression and see how it works

    I won’t give spoilers to the OP… but to figure out what was really going on, I had to do it with a bunch of single replace-once clicks, so I could see.

    I really like that it works without restrictions on quantity.

    And, to prove that I understood it, I got it down to a single regex with three replace all clicks, by using a second alternation and a second capture group in the search, and another conditional in the replace:
    SEARCH (?-s)^(?=#)(?:#+\t(.*)#$)|\w+[^\w\r\n]*(?=.*\t)|\h*(.+?[.?!])(?!#)
    REPLACE ?2\2\t\2#\r\n:?1\1:#

    That was fun. :-)



  • Hi, @thunderdog, @peterjones and All,

    Peter, nice shot , too ! As a sort was necessary, I simply thought about a second regex S/R. But you’re right, we can perfectly use a composite regex to get all the job done ;-))

    So, the road map is :

    • Perform the regex S/R, twice

    • Run the alphabetic ( Unicode ) ascending sort

    • Perform the same regex S/R, once again

    Note that I also slightly shorten your search regex ! Here is the final solution :


    Assuming the initial text of @PeterJones, normalized in two lines, which does end with either a dot, an exclamation mark or an question mark !

    The day was very difficult today. I hope the next few days get better. Do you hope the same? I hope you hope the same as me!
      Hello, world! This sentence is just ten teeny-tiny words long, silly goose! No?
    

    SEARCH (?-s)^#+\t(.+)#$|\w+[^\w\r\n]*(?=.*\t)|\h*(.+?[.?!])(?!#)

    REPLACE ?2\2\t\2#\r\n:?1\1:#

    After two clicks on the Replace All button, we obtain :

    ######	The day was very difficult today.#
    ########	I hope the next few days get better.#
    #####	Do you hope the same?#
    ########	I hope you hope the same as me!#
    
    ##	Hello, world!#
    ###########	This sentence is just ten teeny-tiny words long, silly goose!#
    #	No?#
    

    Now, we click on the Edit > Line Operations > Sort Lines Lexicographically Ascending option and get :

    
    #	No?#
    ##	Hello, world!#
    #####	Do you hope the same?#
    ######	The day was very difficult today.#
    ########	I hope the next few days get better.#
    ########	I hope you hope the same as me!#
    ###########	This sentence is just ten teeny-tiny words long, silly goose!#
    

    Let’s go back to fhe Replace dialog , again and use the same S/R

    SEARCH (?-s)^#+\t(.+)#$|\w+[^\w\r\n]*(?=.*\t)|\h*(.+?[.?!])(?!#)

    REPLACE ?2\2\t\2#\r\n:?1\1:#

    After one final click on the Replace All button, here is your expected text :

    
    No?
    Hello, world!
    Do you hope the same?
    The day was very difficult today.
    I hope the next few days get better.
    I hope you hope the same as me!
    This sentence is just ten teeny-tiny words long, silly goose!
    

    Wow… Awesome !

    And if we want to get all that text in a single paragraph, just apply this last regex S/R !

    SEARCH ^\R|(\R)

    REPLACE ?1\x20

    giving :

    No? Hello, world! Do you hope the same? The day was very difficult today. I hope the next few days get better. I hope you hope the same as me! This sentence is just ten teeny-tiny words long, silly goose!
    

    Best Regards,

    guy038


Log in to reply