Need to produce a very basic formatted RTF from OCRed text / complex RTF file



  • I am converting old paper journals to RTF files. I scan, OCR, convert to PDF and then also want an RTF file of selected articles in the journals.

    It all works fine until I create the RTF file. The file that is generated in too complex for my boss.

    What the file needs is:

    • Bold, Italics - ‘\b’ ‘\i’, start and end ‘\b0’ ‘\i0’ of text
    • Start of paragraph - ‘\par’
    • Hyphens - Nonbreaking ‘\_’ and Hypenation point ‘\-’
    • Nonbreaking spaces ‘\~’
    • Character escape ‘\’‘xx’ ‘\x’‘97’ for emdash (long dash)
    • and also a basic prolog and font table, which I can manually insert

    I do the OCR with OmniPage Ultimate. I have a few options:

    • Save as a Plain text RTF -
    • Best option so far - have all I need and nothing extra, except there are no text formatting (bold and italic) commands
    • I need to manually insert the bold and italic commands
    •       guy038 provided this terrific RegEx sequence to easily insert tokens for bold and italic and then I do a find/replace to end up with the RTF formatting I need  [RegEx by guy038 ](https://notepad-plus-plus.org/community/topic/12774/select-inline-text-and-then-add-to-beginning-end-of-selection/7)
      
    • Only problem is, have about 10-12,000 of these to insert
    • Save as a Formatted Text RTF -
    • Inserts way too many additional RTF commands
    • Manually trying to delete these commands is more tedious than added the bold and italic commands
    • Copy from the OmniPage text editor and paste as RTF into WordPad
    • The RTF code is much cleaner than saving directly as a formatted text RTF, but still too complex

    Does anyone have any experience or suggestions on how to better accomplish what I need?

    Some of my other ideas include:

    • Copy the text from the OmniPage text editor and then paste as RTF into a more simplistic / less feature rich editor
    • My reasoning is that only the basic RTF commands I listed at the beginning will be recognised by the simple editor and inserted into the document, while the unrecognised ones, which I don’t want, are lost in the Paste process
    • When I open the RTF in NPP with the userDefineLang_RTF.xml file, much of the RTF code is highlighted. Can I use that to my advantage to:
    • Delete the highlighted RTF commands?
    •   The bold and italic RTF commands are hightlighted, but perhaps that can be overcome with adjusting the userDefineLang_RTF.xml or using RegEx to change those codes temporarily to something else that will not be deleted
      
    • Use a different method or program to generate the RTF file I need?
    • Use an additional program to process the RTF output and strip away all the RTF commands except those I need / specify?
    • Is there a way to implement another RegEx to resolve my situation?

    I am open for suggestions. Been working on this for a long time and I am about ready to output Plain Text RTFs and insert the tokens for bold and italic and use guy038 RegEx to insert the RTF formatting for bold and italic. All the other RFT commands listed above are already in the Plain Text RTF I generate now.

    Help me geniuses!!

    Thanks in advance,
    Blair



  • @Blair-Brenner

    not being an rtf expert so forgive my ignorance.
    What sounds logical for me is to try to solve the issue by using
    Save as a Formatted Text RTF
    Because than you only have to identify the tags(?) you don’t need and
    do a find/replace all with empty string. If there a many but unique tags then
    you might think of using regular expression alternation syntax to achieve this
    in one go.

    Cheers
    Claudia



  • @Blair-Brenner Neither am I an expert in RTF, but until recently, I was unaware of a tab that I’ve probably seen subconsciously hundreds of times. Just thought I’d point out that in the Find / Replace dialog box, there is a tab called Find in Files (also accessible through the Search menu -> Find in Files… Ctrl+Shift+F). If these thousands of replacements are in multiple files, perhaps this feature could make the job easier?