Join lines from clipboard data

Jonathan Wendt

So I’m not really sure if this is possible, but if someone knows a way it would make my life much easier.

Essentially I have to copy lots of data from PDFs to write the data into JSON format. Unfortunately copying the text from the PDF introduces many unnecessary line breaks. My solution, and workflow that I’d like to cut down on goes like this:

Copy lines from PDF
Paste into separate Notepad++ file
Ctrl+J to join the lines
Re-select everything and cut
Paste the data where it needs to go in the JSON

Ultimately this process is very tedious. What I’d like to do is be able to copy the lines from the PDF I want, and then be able to join the lines in the clipboard data and paste it in the JSON in one go. Is this possible at all? Thank you very much in advance.

Alan Kilborn

@Jonathan-Wendt

If you’re willing to install and use the PythonScript plugin for Notepad++, then a shorter workflow is possible. That workflow would be:

Copy lines from PDF
Move caret to insertion point in the JSON file in its Notepad++ tab
Run the script (could be tied to a key-combination for easy-running)

The script would look like this:

# -*- coding: utf-8 -*-

from Npp import notepad, editor

try:
    editor3h  # third editor, hidden
except NameError:
    editor3h = notepad.createScintilla()

if editor.canPaste() and editor3h.canPaste():
    editor3h.selectAll()
    editor3h.paste()
    editor3h.rereplace('\r\n', ' ')
    editor3h.selectAll()
    editor3h.copy()
    editor.paste()

PeterJones

@Jonathan-Wendt said in Join lines from clipboard data:

Copy lines from PDF
Paste into separate Notepad++ file
Ctrl+J to join the lines
Re-select everything and cut
Paste the data where it needs to go in the JSON

Why not just paste it where you want it, and do the join there? That cuts out 2 of the 5 steps.

But it can be done with a script using a plugin like PythonScript… which I was working on when @Alan-Kilborn posted his. I guess I was too slow.

Jonathan Wendt

@Alan-Kilborn This was EXACTLY what I was looking for! Thank you so much.

guy038

Hello, @alan-kilborn and All,

I was intrigued with the statement :

    editor3h  # third editor, hidden

Do you mean that editor3h is an object which represents a virtual document, where you can perform any manipulation, independently from editor1 representing the main view document and editor2 representing the secondary view document ?

Now , I have no problem with the commands below, because it’s obvious that, after an S/R, wherever this data is located, you need to select the modified results, copy them in the clipboard and paste them somewhere ( At the insertion point of the JSON file in this specific example )

    editor3h.rereplace('\r\n', ' ')
    editor3h.selectAll()
    editor3h.copy()
    editor.paste()

But, as you said that the first step was to copy PDF file contents ( …in the clipboard ), I don’t see the purpose of the first editor3h.selectAll() command. To my mind, the editor3h.paste() command, which would copy the PDF contents of the clipboard to the virtual editor document, should be enough !?

From a Python beginner ;-))

Best Regards,

guy038

Alan Kilborn

@guy038 said in Join lines from clipboard data:

Do you mean that editor3h is an object which represents a virtual document, where you can perform any manipulation, independently from editor1 representing the main view document and editor2 representing the secondary view document ?

Short answer to long question:
Yes. :-)

I don’t see the purpose of the first editor3h.selectAll() command

To my mind, the editor3h.paste() command, … should be enough !?

You are probably only thinking about the first time the script is run.
On first running, editor3h does not exist, so it is created.
At that point there is no text in its document, so the select-all does effectively nothing.
On second and further runs of the script, editor3h already exists – we don’t need to have N++ go thru the overhead of create a new one each time, we can just reuse the old one.
But in this case, there is already text in the document (from the previous run).
Selecting all of the text and then doing a paste effectively causes editor3h to then only contain the data from the paste.
I could have done it in other ways, e.g., editor3h.setText("") as well.

BTW, I could have dispensed with the editor3h technique altogether and just pasted into the user’s current document and manipulated from there. There are a lot of ways to do things (TMTOWTDI). But that way tends to get a bit messier because when you do an editor.paste() and you need to manipulate what you pasted, the size of what you pasted – so you know where to find the data in the doc – is not provided automatically and takes some effort to calculate. Now if text remained selected upon pasting, it would be easy.

guy038

Hi, @Alan-kilborn and All,

My God ! Of course, I didn’t think about successive runs ! And it’s quite logical :

The editor3h.selectAll() command, first, select all contents of editor3h, if any
The editor3h.paste() command replaces this possible selection with the clipboard’s contents and paste it in editor3h

Thanks,also, for your additional information :

BR

guy038

Alan Kilborn

@guy038

:-)

Perhaps this editor3h technique could help you with the regex stuff you do. A short script is a probably easier to work with than macros (for those cases where you are building up several regex replacements in a row). The above script seems to contain all the examples you’d need.

One thing to be careful of is, when using Python strings to contain regexes: Sometimes you want “raw” strings, and sometimes plain strings. I used a plain string above, with '\r\n', which encodes a CR and a LF (because the backslash is an “escape”). Often with regexes, though, you want backslashes to actually be themselves, in which case you do your string with an r out front, e.g. r'...' or r"...".

These should be equivalent, but which one is easier to read when you’d rather be thinking about regex content, rather than proper backslashing?:

myregex1 = "\\(.*?\\)"  # find literal ( followed by some chars followed by literal )
myregex2 = r"\(.*?\)"  # same, but easier to read and think about

Alan Kilborn

I used a plain string above, with ‘\r\n’, which encodes a CR and a LF (because the backslash is an “escape”)

Probably that was a bad example, to illustrate my point about raw strings versus not, in Python. Why?

Because editor.rereplace('\r\n'), ' ' and editor.rereplace(r'\r\n', ' ') will do exactly the same thing (for different reasons) when run.

In the first case, without the r, the literal characters for carriage-return and line-feed will be searched for; in the second case, they are first interpreted – as regular expression \r and \n, respectively. In the end it is the same effect, but…bad example for illustrating the point.

Also, I probably should have used r'\R', anyway – and that one is really obvious that we are handing over interpretation to the regex engine, because there is no such \R character!

guy038

Hi, @alan-kilborn,

As you just spoke about normal and raw strings, this made me remember of something weird in Python world !

I used your script, below, to test the statistics of some files, included .exe ones

https://community.notepad-plus-plus.org/post/61801

In the initial script, the search regex is simply '\w+'. But I needed this regex '(\x00?[A-Za-z0-9_])+' and it did not work. I had to use raw string, so the regex r'(\x00?[A-Za-z0-9_])+' !

For instance,

In a new tab, type the simple string @ABCDE
Run the tiny script, below, executed from the console ( May be, Alan, it’s not a good Python construction but it works !! )

console.clear() ; editor.research ('(\x40?\w)+', lambda m: console.write (m.group(0)))

It correctly write the string @ABCDE on the console

Now, change the @ char with the NUL char, with the help of the Character Panel
Run this similar script :

console.clear() ; editor.research ('(\x00?\w)+', lambda m: console.write (m.group(0)))

This time, we get an error !

Then, run this third try, using a raw string :

console.clear() ; editor.research (r'(\x00?\w)+', lambda m: console.write (m.group(0)))

Wow ! It works as expected and displays the NUL char followed with the string ABCDE, on the console

I hope, I’m not disturbing you too much ;-))

See you later,

Cheers ( almost the right moment, in France ! )

guy038

Alan Kilborn

@guy038

Without looking into your specific examples just now – I’ll do that later, after my own “Cheers!” moments – I’ll say:

If you follow the rules, you get what is expected. If you don’t follow the rules, you get “unexpected” behavior – which could sometimes be what you expect, or often not!

The approximate rules are:

If you are going to use a regular string, i.e., "..." and that string is going to contain a literal backslash, you must double that backslash (to escape it).
If you are going to use a “raw” string, i.e., r"...", then you do NOT (typically!) double any backslashes. This would be the preferred way to do it for someone that writes a lot of literal paths in their source code, or someone that works heavily with regex.

Of course, there’s a bit more to the story, but that will come in a later posting…

There’s also “byte strings” and “unicode strings”, denoted b"..." and u"..." respectively, but these are mainly (but not always) Python3 things, and Python3 for PythonScript is still in beta, so we won’t really discuss them.

Also note that we aren’t going too far off the topic of N++ with this, just enough for scriptwriters – definitely “on-topic”.

Alan Kilborn

@guy038

So back to your examples.
It all does make sense.
Here’s how:

The first example is '(\x40?\w)+'.

What happens here is that Python encodes \x40 as a single character and passes it to the research function. Because \x40 is @ it works.

Had you done, r'(\x40?\w)+' instead, Python would NOT have touched the \x40 part, and would have passed it as 4 distinct characters. The research function would have passed these same 4 characters into the Boost regex engine, and inside there it would have seen the notation \x40 and converted 4 characters into one, equivalent to @. So the difference is WHERE the encoding of the four characters is done.

Thus, for the first example, it is successful either way, with a normal Python string or a raw string. It just gets there by different routes.

The second example is '(\x00?\w)+'. What happens here is that the \x00 gets encoded into a single NUL byte at Python decode time, and because (often) C strings are NUL-terminated, meaning the NUL signals the end of the string, what happens is that only a single ( – what comes before the NUL in your string – is seen by the regex engine. The regex engine returns that that is clearly a badly formed regex, and says:

RuntimeError: Unmatched marking parenthesis ( or \(.

When we move to the third example (related to the second), specifically: r'(\x00?\w)+', you are again shifting the decoding to late in the game – when Boost gets hold of it. Thus, here Boost sees (\x00?\w)+ – each individual character in that – and things work. The string won’t get NUL-terminated early. Boost gets a chance at the whole string.

Hopefully this makes sense.

The bottom line is that Python will use some of the same “escape sequences” that regex will use, if you give it a chance to. By “give it a chance to”, I mean by using a plain string with the syntax "..." (note, no little r preceding).

If you want Python to leave your strings alone, so that they are passed fully as typed, to something else, e.g., the regex engine, wrap them as a raw string, i.e., r"...".

So, what happens to the \w in the first example, when Python sees it? There is no r out front, so Python scans the string and can do things to it. Well, the answer is that Python does not know what \w is, so that it sends it on as 2 characters, \ and w. Thus this goes on to Boost just as intended.

There’s yet another caveat. In a r"..." string, normally any \ are just literal characters, but if you want your final character in the string to be a backslash, you must double it! Thus if you type a=r'abc\' at the console, you’ll get an error:

SyntaxError: EOL while scanning string literal

so you’d need to do:

a = r'abc\\'

instead. This most often comes up when specifying a directory, example:

mydir = r"c:\mydir1\mysubdir2\\"

So just some explanation about this; hopefully it helps in some way.

guy038

Hi, @alan-kilborn and All,

Ah, very interesting insight, indeed ;-))

Yes, it’s not sometimes easy to know when characters are interpreted, if different nested processes are involved !

So, globally, in most situations, it seems better to always use raw strings. I mean… when using regexes, of course

BR

guy038

Alan Kilborn

@guy038 said in Join lines from clipboard data:

it seems better to always use raw strings. I mean… when using regexes, of course

I would say this is true.

Probably most normal Pythoners don’t use a lot of backslashes in their strings, so the simple "..." syntax works fine.

And you certainly can use "..." with regex as well, but just remember to “double up” every backslash if you do. (But that can make your regexes a nightmare to look at).