• Login
Community
  • Login

Editing 600 mega XML file

Scheduled Pinned Locked Moved General Discussion
17 Posts 7 Posters 76.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Ricardo @Maor Bachar
    last edited by Jul 30, 2015, 11:04 PM

    @Maor-Bachar said:

    seem like it doesn’t have XML style?

    I think they unfortunately restrict extra coloring schemes to their paid version: http://www.editpadpro.com/cscs.html

    1 Reply Last reply Reply Quote 0
    • A
      AJ Baxter
      last edited by Aug 4, 2015, 12:08 PM

      What size XML can you load in N++ ?

      1 Reply Last reply Reply Quote 0
      • R
        Ricardo
        last edited by Ricardo Aug 4, 2015, 9:17 PM Aug 4, 2015, 9:17 PM

        I don’t think file type matters, but…
        If I have no plugins and no other file opened, I can open files with maximum size around 362MB, before it displays “File is too big to be opened” error.
        But with my plugins enabled, the limit lowers to about 150MB.

        1 Reply Last reply Reply Quote 0
        • R
          Ricardo
          last edited by Ricardo Aug 7, 2015, 7:42 AM Aug 7, 2015, 3:01 AM

          Hello, for people needing a x64 build of Notepad++ I made one here.

          Notes:

          1. This is an easy unofficial build – not tested by devs.
          2. You need 64-bit OS.
          3. It is not compatible with 32-bit plugins (ie. all available plugins). You need to rename or move to other place your /plugins folder.
          4. It can open very large files. Important: make sure “Word Wrap” is disabled before opening such files.
          5. The 7z package above contains only the binaries you need to replace. Please do a backup of your current 32bit files (rename or move).
          1 Reply Last reply Reply Quote 0
          • D
            David Bailey
            last edited by Aug 7, 2015, 7:54 PM

            I really think that anyone requiring to edit a 600MB XML file should be thinking of alternatives! I mean those files contain structures, and you really need to use something that will respect that structure - which isn’t a text editor.

            I don’t know what this file contains, but it might be useful to arrange that it was stored as a number of much smaller files. Alternatively, I guess you could load its structure into Mathematica, manipulate it, and write it out again!

            I don’t like to see a tool like NP++ being pushed to perform basically ridiculous tasks.

            David

            1 Reply Last reply Reply Quote 0
            • K
              Klaus Lehmann
              last edited by Aug 9, 2015, 3:51 PM

              hi david
              excuse me.
              be happy, that You never had to edit a 10GB-xml-file.
              therefor notepad++ isn’t NOT Your friend ;-)
              but, I want to give y try to ricardo’s 64bit edition! it’s new to me ;-)
              Yours klaus

              1 Reply Last reply Reply Quote 0
              • D
                David Bailey
                last edited by David Bailey Aug 9, 2015, 5:15 PM Aug 9, 2015, 5:12 PM

                Klaus,

                “be happy, that You never had to edit a 10GB-xml-file.”

                I am indeed - but I am also happy that I have never tried to solder electronic components with a blow torch, or fry an egg on a smoothing iron! Even if I had, I wouldn’t try to make suggestions on a blow torch forum regarding improvements to blow torches that might make that more feasible!

                I think a lot of software gradually bloats out - both in bytes and in terms of complexity. Ultimately valuable software can die that way! I think NP++ has remained consistently focussed on providing solutions for those who need to edit normally sized text files - particularly program source - and trying to add on outlandish capabilities would be a severe distraction.

                Switching to 64-bits would probably only be the first step in a project to edit 10 GB data files, because I am sure there must be many processes inside NP++ that depend on being able to scan across an entire file in a sensible amount of time.

                It might be more constructive if you described how you got into this mess, and someone might be able to offer some constructive suggestions!

                As a preliminary suggestion, I would suggest that you write a C program to read the file, and recognise the data you want to change. You could then modify the program to actually perform a change (but I would back up your file before you start :) ).

                David

                1 Reply Last reply Reply Quote 0
                • R
                  Ricardo
                  last edited by Aug 10, 2015, 12:42 AM

                  woot
                  For editing a 10GB file, I guess the best would be an editor that doesn’t load the entire file into memory, but reads it direct from disk. Otherwise, you would need a lot of RAM.

                  K 1 Reply Last reply Sep 12, 2015, 3:38 PM Reply Quote 0
                  • G
                    guy038
                    last edited by guy038 Aug 10, 2015, 1:31 PM Aug 10, 2015, 1:25 PM

                    Hello Maor Bachar and All,

                    Why don’t you give a try to the old, but excellent, script program GAWK.exe, which can be used, in addition, both, on Unix or Windows machines (with the appropriate executable file, of course ) ?

                    You can get the latest Windows version of gawk.exe, AFAIK, ( v4.1.0 ), at the address, below :

                    https://code.google.com/p/gnu-on-windows/downloads/detail?name=gawk-4.1.0-bin.zip&can=2&q=

                    To get an overview of the main features of gawk and what’s new in the v4.1.0 version, follow the link :

                    http://www.drdobbs.com/open-source/gnu-awk-this-is-not-your-fathers-awk/240158351

                    And, here is, below, the link to download the last reference book, on GAWK, v4.1.x, in various formats :

                    http://www.gnu.org/software/gawk/manual/

                    The PDF form is quite recent : April 2015 !


                    To test it, I created a 1 Gb file. ( I didn’t create a 10 Gb file, as my old Win XP laptop, with only two 40Gb partitions would not accept it ! ). But, when I go back to work , as I’m presently, on holidays, I’ll be able to build bigger files !

                    Briefly, I first created, in N++, a 200 Mb file, about, named xxx.txt. Then, I recopied this file, five times, to the All.txt file, in a DOS windows, with the commands :

                    type xxx.txt>All.txt
                    type xxx.txt>>All.txt
                    type xxx.txt>>All.txt
                    type xxx.txt>>All.txt
                    type xxx.txt>>All.txt
                    

                    I ended the All.txt file with a last line

                    echo END of the FILE>>All.txt
                    

                    Length of the lines are from 0 to 150. Finally, the All.txt file contains, about, 18 498 000 lines, for 1026 Mo bytes !


                    I used the simple gawk script below, named Script.txt. It’s IMPORTANT to note that this file is ANSI encoded

                    #----------------------------------------------------------------------------------------------------------------#
                    #  SYNTAX to RUN, in a CMD windows, from the FOLDER, where are the 3 files GAWK.exe, Script.txt and All.txt      #
                    #            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯                        ¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯     ¯¯¯¯¯¯¯      #
                    #                                                                                                                #
                    #  SYNTAX :                                                                                                      #
                    #  ¯¯¯¯¯¯                                                                                                        #
                    #                                                                                                                #
                    #    echo [Regex|String]|gawk -f Script.txt - All.txt[ File2.txt[ File3[ ...]]][ >[>]See.txt]                    #
                    #----------------------------------------------------------------------------------------------------------------#
                    
                    {
                      if ( pattern == "" )
                        {
                    
                          #---------------------------------------------------------------------------#
                          #  The STANDARD INPUT -  is read, BEFORE the USER FILE(S), so ALL the TEXT  #
                          #    of the DOS COMMAND 'echo' will be STORED in the VARIABLE 'pattern'     #
                          #                                                                           #
                          #   Note : No BLANK, at END of "echo" command, unless part of the REGEX     #
                          #                                                                           #
                          #  IF the PATTERN is NOT initialized  :                                     #
                          #    SET the VARIABLE 'total' to the value 0                                #
                          #    SET the VARIABLE 'pattern' to the value of the LINE field $0           #
                          #      = ALL the TEXT of the DOS COMMAND 'echo'                             #
                          #    SKIP to the NEXT line to READ                                          #
                          #---------------------------------------------------------------------------#
                    
                          total = 0 ; pattern = $0
                    
                          next
                        }
                    }
                    
                    #------------------------------------------------------------------------------#
                    #    IF the CURRENT line MATCHES the PATTERN of the VARIABLE 'pattern' :       #
                    #      INCREMENT, by ONE, the VARIABLE 'total' [ and PRINT the CURRENT line ]  #
                    #    SKIP to the NEXT line to READ, in ALL cases                               #
                    #------------------------------------------------------------------------------#
                    
                    {
                      if ($0 ~ pattern) { ++total }
                    
                    #   OR     if ($0 ~ pattern) { ++total ; print }
                    
                      next
                    }
                    
                    
                    #-------------------------------------------------------------------------------------------------------------------#
                    #  AFTER ALL the USER files are READ, DISPLAYS the NUMBER of lines, MATCHING the PATTERN of the VARIABLE 'pattern'  #
                    #-------------------------------------------------------------------------------------------------------------------#
                    
                    END { print "\n  Number of lines MATCHING \x22" pattern "\x22  : " , total }
                    

                    The command echo .*|gawk -f Script.txt - all.txt give me the number 18 498 231, which is the number of lines matching the regex .*

                    The command echo END of the FILE|gawk -f Script.txt - all.txt give me the number 1, noticing that it correctly read the last and 18 498 231th line, of the 1 Gb file All.txt !

                    On my old Win XP laptop, with, only, 1GB RAM ( 2 * 512 Mo ), I got the result, from each command, in 1mn 30s about ! Not to bad, isn’t it ?

                    Of course, my script just count the matchings but, with GAWK, you can manipulate files, in multiple ways, do calculus or searches/replacements or … It’s a very powerful tool, although tiny in size : 223 246 bytes, for the v4.1 version.

                    You’ll just have to scan, a bit, the gawk documentation !

                    And, as Ricardo said, it seems, to directly read files, from disk :-) => No limit in size, seemingly

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • K
                      Klaus Lehmann @Ricardo
                      last edited by Sep 12, 2015, 3:38 PM

                      @Ricardo
                      I think crisp (64bit) can do it!
                      but it costs 100-250 per Year!

                      and for my biggest files (with rubbish-xml-gore) in 10GB files, I must to use: VEDIT Pro (64bit)
                      attention: high prize! $240 (no joke!)

                      Yours klaus

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors