Sillybean

Import HTML Pages

This plugin will import a directory of files as either pages or posts, according to configurable settings. You may specify the HTML tag containing the content you want to import (e.g. ,
<div id="content"> or <body>) or the name of a Dreamweaver template region (e.g. “Main Content”).

If importing pages, the directory hierarchy will be preserved. Directories containing the specified file types will be imported as empty parent pages. Directories that do not contain the specified file types will be ignored.

As files are imported, the resulting IDs, permalinks, and titles will be displayed. On completion, the importer will provide a list of Apache redirects that can be used in your .htaccess file to seamlessly transfer visitors from the old file locations to the new WordPress posts or pages.

Options:

  • import pages or posts
  • specify content and title as HTML tags or Dreamweaver template regions
  • remove a common phrase (such as the site name) from imported titles
  • specify file extensions to import (e.g. html, htm, php)
  • specify directories to exclude (e.g. images, css)
  • if importing pages, specify whether your top-level files should become top-level pages or children of an existing page
  • choose tags, categories, and custom taxonomies
  • choose status, author, and timestamp
  • use meta descriptions as excerpts
  • choose whether to clean up bad (Word, Frontpage) HTML

Requires PHP 5.

Download at wordpress.org »

Support note: If you have a problem importing your files, please provide a screenshot of your import settings and an example of one of your files. Without these things, it’s difficult to diagnose your problem.


Files to be imported

Files to be imported

Imported pages

Imported pages

Options screen

Options screen

new: clean up Word HTML option

new: clean up Word

Results: imported pages and rewrite rules

Results: imported pages and rewrite rules

188 comments

Sharing this post? The short URL is http://sillybean.net/?p=2321

188 Responses to “Import HTML Pages”

  1. RHCdG says:

    Hi Stephanie,

    I appreciate the time you’re putting to help everybody out, including me. I did, however, send you a sample html-file through the email form on your website; perhaps you did not receive it? Or maybe it did not give you all the information you need? Are there any other settings you need to know to fix this problem? Perhaps it would be useful if you could list the exact details or settings you need to solve this
    _.
    __.
    ___.

    problem?

    Thanks,
    Rutger

    • Stephanie says:

      Hey, Rutger. I’m sorry, I don’t seem to have received your email. However, I think I know what the problem is. Did you by any chance erase the preset options for directories to skip? If so, but this back in:

      .,..

      (Period, comma, two periods.) I think I need to hardcode that into the plugin; otherwise, if you take it out and you’re on a UNIX-based OS, the importer will trip over all the hidden system files.

      Try that out and let me know how it goes!

      • Julien says:

        I don’t know about Rutger, but I had that bug, I was really trying everything abd was feeeling kind of desesperate… and adding .,.. in the «Directories to skip» field solved it!

        It’s a very disturbing bug, as it add hundreds of empty pages each time you try an import. (it’s nice that you point to the Mass deleter plugin right there in the plugin setup page!)

        So, yes, you probably should add that preset somewhere else, or maybe just a line of text next to the field.

        Thanks for that great plugin, keep up the good work.

  2. Stephanie – Thank you for this fabulous tool. I’m using it to import 100s of pages from an old Radio Userland site into a WordPress blog. This presented a few new problems: multiple posts on the same date are stored on a single html page, and the post date has to be retrieved either from the url or a bolded item in the page.

    I was able to modify your code to handle this, pulling out the code that handles each new page onto its own, and from there extracting the “identify the post and insert it” code into a second new function, which I then installed into a loop. Happy to send it for your review/inclusion if you like. It’s more or less clean.

    • Stephanie says:

      Thanks, Chris, that would be great! I was just thinking earlier today that I need to create some more flexible options for dates and bylines, but it’ll be about two months before I can devote the time.

  3. RHCdG says:

    Hi Stephanie,

    Do you mean I should put .,.. before each directory name in the box under “Skip directories with these names”?

    Like so:

    .,..audio, .,..pdf, .,..pics

    ?

    I tried this, and the import was different than before but I got scared when it started displaying “cgi-bin” hundreds of times during the import… I hit the Stop-button on my browser, and I now have 820 “planned”, i.e. empty pages.

    I know from others in this thread that your tool works miracles, but how I wish I would get it to work for my site!

    Thanks,

    Rutger

    PS I did upgrade to the new version of the plugin!

  4. RHCdG says:

    Hi Stephanie,

    I just emailed you a screenshot; thanks again for your help!

    Rutger

  5. lopo says:

    hi
    nice plugin
    but i have this problem:

    i have one html page in this page i have some article.

    the plugin workin fine for the first article but the other are ignore.

    eseple

    article1
    content

    article2
    content

    article3
    content

    some one can help me
    thanks advanced

  6. David says:

    Hi Stephanie,
    Super great tool. Thanks so much! Quick question – even though i include in the Allowed HTML field, still I can’t get the tool to keep my BRs. I’m using Version 1.21. Thanks for any pointers!
    David

  7. Kent says:

    I’m getting these error messages. Please help. I have a lot of pages to import into my wordpress site! Thanks!

    Warning: scandir() [function.scandir]: URL file-access is disabled in the server configuration in D:Hosting. . . pluginsimport-html-pageshtml-import.php on line 554

    Warning: scandir(http://www.. . . .com/azrefiles/) [function.scandir]: failed to open dir: no suitable wrapper could be found in D:Hosting. . .pluginsimport-html-pageshtml-import.php on line 554

    Warning: scandir() [function.scandir]: (errno 0): No error in D:Hosting. . .import-html-pageshtml-import.php on line 554

    Warning: Invalid argument supplied for foreach() in D:Hosting. . .import-html-pageshtml-import.php on line 555

  8. Desire Dupas says:

    Hi Stepahnie !

    Such a great tool :-) ))

    Unfortunately, got the same problem as Kent (http://sillybean.net/code/wordpress/html-import/comment-page-3/#comment-19774)

    When I tried to parse something, the plugin tell me :

    ??
    Warning: scandir(http://pagesdor.truvo.be/search/Soins_infirmiers_%C3%A0_domicile.html) [function.scandir]: failed to open dir: not implemented in D:xampplitehtdocsDocTelwp-contentpluginsimport-html-pageshtml-import.php on line 395

    Warning: scandir() [function.scandir]: (errno 0): No error in D:xampplitehtdocsDocTelwp-contentpluginsimport-html-pageshtml-import.php on line 395

    Warning: Invalid argument supplied for foreach() in D:xampplitehtdocsDocTelwp-contentpluginsimport-html-pageshtml-import.php on line 396

    ??

    Any idea where does it come from ?

    I’ve used your plugin a few months ago and it worked fine (http://sillybean.net/code/wordpress/html-import/comment-page-2/#comment-16531)

    Don’t know why does it bug now ….

    Please help :-) ))

  9. Kent says:

    Actually, I just figured it out. Instead of using the url to the source files, you need to use the absolute hosting path. It should be something like D:Hosting[and then some numbers]. Good luck.

    Stephanie, thanks for this great tool. I also liked the output of the list of necessary redirects. I then used the 301 redirect plugin to input those.

    Regards,

    Kent

  10. Kym says:

    Hi there. I ran the plugin but ran into this error:

    Call to a member function xpath() on a non-object in /home/lulu/public_html/wordpress/wp-content/plugins/import-html-pages/html-import.php on line 593

    The plugin seems to be running but the contents won’t import:

    “Could not import 10000-neopoints-winners.html. You should copy its contents manually.
    Could not import 13goingon30.html. You should copy its contents manually.
    Could not import 13goingon30_sol.html. You should copy its contents manually.
    Could not import 200mpeanutdash.html. You should copy its contents manually.
    Could not import 200mpeanutdash_sub1.html. You should copy its contents manually.”

  11. Kym says:

    Quick update – I am able to import pages now by editing the “select content by” area.

    I have select content by HTML checked and changed the HTML recognition tag to p (to represent ).

    The only problem is that I have many tags in my html pages. The importer only imports the first paragraph and skips the rest.

    Example page http://www.pinkpt.com/pages/200mpeanutdash.html

    Is there a way to just have it import the entire HTML page?

    Thanks in advance for such a time-saving plugin. You’re a life saver. :-)

  12. Faydra Deon says:

    This may help someone else:

    If you’re not certain what your full path is on your server, you can find that when you log into your Control Panel of your host.

    Also, I was having trouble bringing in the HTML pages until I actually put the folder with the files inside the folder where the Import HTML Pages plugin exists. The tip says to make sure the files are on the same server as WordPress, but I had to move the folder into the plugin’s folder before it would work correctly. My host is GoDaddy, just in case anyone thinks it might be a hosting issue that causes me to have to do this.

    Thanks, Stephanie, for this plugin. I will definitely be making a donation.

  13. Steven says:

    Hi,

    Can’t get this to work. Either I get hundreds upon hundreds of pages with content “.” or nothing happens, no file is imported. I don’t see any way of attaching screenshots here, but nothing happens with .,.., in the exclude field and the path server path copied faithfully from the ftp utility, and the hundreds and hundreds happens — well, I’m not sure, it’s just a major pain deleting them.

    When are you coming out with single file version?

    • steph says:

      I’m sorry you’re having trouble. The Mass Page Remover plugin (linked from the importer screen’s tips section) should make it easy to remove those extra pages.

      I’ll work on the next major version this summer. I plan to include the single file version and the image processing.

      • Steven says:

        This really needs a fundamental reworking for usability. This reliance on .., etc. and one server or the other, is absurd from a usability standpoint. One should not have to be a developer to figure out how to use this, or how to get it to work. When it creates hundreds of unwanted blank html pages, which cannot be deleted except manually (because the remove html page plugin does not work), I’d have to say this plugin was 50-50, and therefore unusable.

  14. Bill says:

    This is a great plugin. I have one question — when importing by tag, is it possible to remove the wrapping tag that gets imported?

    For example, if I am importing , I end up getting in my WordPress page (in the source).

    I thought that simply adding /text() to the xpath around line 635 in your code would solve this. It does not! Any help? Thanks.

  15. Bill says:

    My comment got messed up because i included html in it.

    I am searching for:

    [div id="body"]

    and that ends up in my WordPress page.

  16. Sven says:

    Being able to import .txt as well as HTML files would be great!

    • steph says:

      Sven, I’ll email you. That might require a completely separate plugin, which is fine, but I’ll need some sample files to work with.

  17. Simon says:

    what is the path for a localhost install on a mac ?

    • steph says:

      It depends. If you’re using MAMP, it’s /Applications/MAMP/htdocs. If you’re using the built-in Apache server, it’s /Library/WebServer/Documents unless you’ve changed it in /etc/apache2/httpd.conf.

  18. FRank Mabna says:

    Hello Stephanie – the html page import plugin deletes all special characters from my imported html text, such as the German ö, ä, and ü.

    Is there a way to allow those special characters and import them correctly?

    Thanks,
    Frank M.

    • steph says:

      Frank, I’m sorry it’s not working correctly. Try commenting out (or removing) line 568 (“if (function_exists(‘mb_convert_encoding’))…”) and see if that works better.

  19. Leon says:

    Hello Stephanie, the plugin works really well except it is deleting some characters, e.g. â

  20. Leon says:

    The characters didn’t come through in the comment properly.

    The characters that done work are e.g. left sing quotation mark and right single quotation mark, but the apostrophe works ok (although turned into a left quote in the previous comment).

    Thanks

    Leon

    • steph says:

      I’ve noticed that WordPress chokes on curly quotes in general. You could try commenting out (or removing) line 568 (“if (function_exists(‘mb_convert_encoding’))…”) and see if that works better.

  21. phil lidgerton says:

    Hi Stepahnie

    I am trying to use your plugin and getting stuck on the beginning directory. I have tried everything to get this working without any joy.
    I have uploaded the html file onto the worpress server. i have put every combination i can think of but it just wont upload it. any help would be appreciated

    phil

  22. phil lidgerton says:

    just notice i need php 5 for this to work. how do i know if i have this and where can i get if it if i dont have it

    phil

  23. patrick tyremblay says:

    Hi,

    i’ve tried the plugin but I habe problems with tags.

    The plugin always remode the /iframe closing tag.

    thank you.

  24. patrick tyremblay says:

    Hi,

    I have also tried to save my html page into a php page without any success. The only thing that is reoved in the created page is the tag.

    I have tried to add iframe to supported tags when chosing the cleaning option for word document.

    my pages are not word documents . They are clean pages.

    thank you.

  25. patrick tyremblay says:

    i’m sorry for my previous post. i’m talking about the tag.

  26. Doug says:

    Hello there and well met
    I am at my wit’s end… I am a person that knows how to write out my HTML.
    My Web Hosting has incorporated, WordPress, to be used. This is thier take on the matter ‘We do not provide training, so users should be somewhat familiar with CMS and cPanel.

    I am frustrated with using WordPress admittedly and I want to try to use it BUT it is aggervating to me..
    I have a series of HTML Pages I want to try to bring it into the website etc

    I have attempted to configure Import HTML Pages using SETTINGS – HTML Page Import Options – Beginning directory:
    /public_html/wp-content/plugins/import-html-pages
    This should be a full path from the server root, on the same server where WordPress is running now.

    but it errors out with the following:

    Warning: scandir(/public_html/wp-content/plugins/import-html-pages) [function.scandir]: failed to open dir: No such file or directory in /home/graceofc/public_html/wp-content/plugins/import-html-pages/html-import.php on line 554

    Warning: scandir() [function.scandir]: (errno 13): Permission denied in /home/graceofc/public_html/wp-content/plugins/import-html-pages/html-import.php on line 554

    Warning: Invalid argument supplied for foreach() in /home/graceofc/public_html/wp-content/plugins/import-html-pages/html-import.php on line 555

    /wp-content/plugins/import-html-pages

    I have asked the web hosting person if they could help.. and they tried and could not aid me..

    PLEASE if anyone would be kind enough to help me, with this plugin and also wordpress …I would gladly talk to them on the phone(US only) or through Emails. … I am sorry but I would not be able pay someone

    • steph says:

      Doug, I think you’re going to have a hard time using this if you’re not familiar with WordPress yet. It’s not really a beginner’s plugin.

      That said, it looks to me as though you’ve entered the path to the plugin directory as your beginning directory. It should instead be the path to your HTML files. Judging by the paths in the error messages, it should be something like:
      /home/graceofc/public_html/folder-where-my-html-files-are

  27. Parneix says:

    Hi Stephanie,

    I’m contemplating the possibility of using your plugin to import my Tumblr blog into WordPress. I’m aware there already is XML exporters, but they do not export media, only links. So all my stuff is still on Tumblr’s servers. I’d rather have all my media (pictures, etc.) stored on my self-hosted WordPress blog.

    Now, Tumblr has a great backup utility which allows anyone to download a full HTML backup of his Tumblr blog locally (it creates a folder on his or her machine). It’s fast and it work like a charm. More importantly, it does a backup of everything : picture and music are downloaded, draft, tags, queued posts etc.

    How do I get started with this? Do I have to upload my HTML backup folder to my wordpress site?

    I’m sure someday someone will come up with an easy export/import solution, but for now I’m looking for workarounds.

    Thanks a lot,

    P.

    • steph says:

      I don’t think this would work very well for Tumblr. For one thing, my importer doesn’t (yet) handle media uploads, so you’d get the HTML with paths to the old file locations.

      • Parneix says:

        Ok Stephanie. That’s a useful information regarding what I intended to do. Thanks anyway for taking the time to answer me. Have a good summer.

        P.

  28. Loren says:

    What a great find Stephanie:

    I’ve got hundreds of pages to import into WP 3.0. Trouble is, nothing happens when I try to import. No messages, no errors, nothing shows as imported into the HTML Page Import window, and of course no pages are created.

    I’ve double checked the “beginning directory”. As best as I can determine I’ve got it right. Perhaps not?

    I’ve got the “.,..,” in the “skip directories…” input field.

    My host is using PHP 5.2.11.

    What else can I check to get your terrific plugin working for me?

    Best,
    Loren

    • Loren says:

      UPDATE: Definitely the setup at the hosting company I use.

      I moved the HTML files to a staged WP installation I have to Godaddy which I use for testing. I installed the Import HTML Pages plugin and it performed perfectly once I got the path correct. My mistake there was obvious based on the error message I received. I had to play a bit with the settings to get titles right, etc. I expected that, and, this took a fraction of the time it takes to cut and paste content from an HTML file into WP.

      From there, it was an easy task to export the content into an XML file, and import into the new site.

      I don’t have time to “bird dog” what server setup causes this problem which would be interesting to know. I’m looking for solutions. Found this one, and it works great! Far from a disaster, nor does it create more problems than it solves.

      Thanks for providing this plugin Stephanie!

  29. Steven says:

    One glance at these comments, this plugin is a disaster, creating more problems than it solves. The few positive remarks are fawning on its creator in the hope she’ll help them make it work.

    It’s a dog, a usability nightmare, and a failure.

    Either redo it right, or drop it from the WP list.

    • steph says:

      Steven, I took your comments on the UI to heart, and I’ll work on that in the next version. Thank you for that. However, please do not hijack every comment thread to air your complaints. Many people have used the plugin successfully. A few, like Loren, have server configurations that prevent it from working properly.

      Constructive criticism, like your first comment, is welcome. Any further pointless screeds will be deleted.

  30. JR Oakes says:

    Steph,

    Thanks for this script. Apparently Steven is a tool. This worked like a charm for us and saved a ton of time. I have put in a request that you be donated to. It takes a lot to put yourself out there with a plugin like this. I want to make it work with the Custom Permalinks script by Michael Tyson. That way you can import it and forget it and google is happy. Any tips you can offer before I begin would be appreciated. If this works I will submit back to you.

  31. Norm says:

    Hey Steph,
    sounds like a great plugin, been pulling my hair out trying to get dreamweaver content into WP.
    Do you have more directions? I’m still lost with whats in the plugin and what you have above.
    Just tring to upload single pages.
    thanks

  32. steph says:

    Norm, it really just does not work with single files right now. I’m going to try to fix that as soon as I get a breather, but my travel schedule is crazy at the moment. Use it on a directory of files, and it should be fine.

  33. Maor says:

    Hi,

    I have hundreds of static html pages which I’d like to import to WP.
    Each file has data for the content, but there are some parts of the data that I’d like to go to several custom fields.
    Is that possible with this plugin to import part of the data to custom fields?
    If not, I’d be happy to know how can I use the core base of this plugin to achieve such a capability.

    p.s. – I’ve posted this question also in the WP forums –
    http://wordpress.org/support/topic/425368

    Many thanks,
    Maor

  34. Adi says:

    I installed this plug in. I tried everything possible and nothing happens. I don’t even get any error messages to see what might be the mistake.

    All it says is 0 pages transferred… Can someone please help me out here?

    Thank You

  35. Loren says:

    Check my post above Adi, on June 23-24. I was having similar problems, and I detailed what worked for me.

  36. Tomek says:

    Hi,

    Any news on a version that will upload the images as well? I’m learning WP as I transfer and old site to WP, and this seems like something that would be interesting. I’d be happy to test beta’s etc if need be.

    Thanks,

    Tomek

  37. Jay Collier says:

    For anyone who is having problems, check to make sure there are no brackets in your content and title tag fields.

    Even though the instructions clearly say to remove them, I left them in the title field and saw both of the problems reported earlier: proliferating posts and xml errors.

  38. Jay Collier says:

    Working great! Thanks for this important plugin, Stephanie.

  39. Alistair says:

    I just thought I might be able to help a bit. I was also getting the warning messages but did a couple of things. I moved all the files into a folder and placed it in the same folder as the import plugin as someone mentioned above.

    I still got the warning errors but realised where it says no such directory and then the path, the first part of this was the path that I wasn’t using, for example /home4/example/public_html/

    I was using a path from /public_html/…… where I should have been using the path from /home4/example/public_html/

    I hope this makes sense.

    To maybe make it plainer, check the error message for the path you should be using.

  40. Hi Steph:
    I am using “Beginning WordPress 3″ (which is great!) as my guide to converting an old static HTML website to WordPress. I would like to use the “Import HTML pages plugin”, but it fails find the directories with the HTML files. Both the HTML files and WordPress are on the same server at netfirms.com.
    I have the beginning directory as /www/natasar.org (this where the old home page is, with a number of subfolders). WordPress is installed in /www/natasar.org/wp/.

    The first error message is:
    Warning: scandir(/www/natasar.org) [function.scandir]: failed to open dir: No such file or directory in /mnt/w0337/d42/s41/b026f150/www/natasar.org/wp/wp-content/plugins/import-html-pages/html-import.php on line 554

    At this point I’m stuck. Do you have any suggestions? Thanks in advance for any help.

    • steph says:

      The beginning directory needs to include the full path from the server root, so in this case it should probably be /mnt/w0337/d42/s41/b026f150/www/natasar.org.

      (Glad you like the book!)

  41. Hi Steph, do you have a cvs server set yet? I am going to try this plugin for my own purposes but make it a bit more user friendly when time permits and I would like to submit it back to you using a svn or cvs server setup.

    Obviously for beginners the explanation about relative paths from local documents is needed as well. Some of the issue s I foresee with word docs is that some file owners lock it down from exporting the images.

  42. NickStrong says:

    Hi Stephanie.
    I am testing/sandboxing the plugin with a small sub-set of my large 90+ page / 1,000 image site. Am getting the pages to import ok (will need a fair amount of tweaking, but that’s ok) having cleaned them up drastically beforehand.
    However, the big bugbear is not bringing in the images -
    [QN1] I gather this is a feature not yet included, and not my own error somewhere along the line . . ?
    [QN2] Assuming this is the case, would it be possible to set all imported pages to look for their images in one place (=folder_A), use WP to import all images into the Gallery (=folder_B), and then do a Search and Replace on the code to substitute folder_B for folder_A . . ? Would this work, would WP actually recognise and properly incorporate the images within the pages . . ?
    .
    Would much appreciate your thoughts/comments/suggestions/workarounds. Nick

    • NickStrong says:

      Update. [I should add that I'm working offline]
      I actually changed all the src/href’s throughout the site to static ones, of the form src=”x:/xxx/xxx/wordpress/wp-content/images/xxx.jpg” before importing the pages. I also imported all the images, in WP Gallery, into that /images/ folder. When I imported the pages all the images are present and correct. So that’s good.
      .
      However . . None of my internal site links work. A link to pagexxx.html becomes broken because that imported page is now referenced/called ‘page=xxx’ or ‘pagename’.
      Assistance would be appreciated. I must be doing something wrong, somewhere.

      • NickStrong says:

        ok, I realise this is the function of the .htaccess file. Would 100 filename redirects be too many there? It would be good if the plugin could (say) create a page slug from the original page file name, then the Permalink could be set to use this and pre-existing links would still work . . (??)
        I’m going to try changing all my page titles to page file names, then importing.

      • NickStrong says:

        ok, changing html page titles to page file names, then importing, works.

  43. Pat J says:

    Just a note — if you have a file of zero size in among the files to be imported, the plugin will fail with an error indicating that neither fopen nor file_read_contents is available. Removing the zero-size file — or renaming it to something the plugin will ignore — cures the problem.

  44. JohnG says:

    Enjoying your book. I’m trying to determine if WP is right for my current sites that are in Joomla and Dreamweaver and your book is giving me confidence that it is.

    Quick question on the plugin…is it compatible with WordPress 3? Just asking as the installer says it hasn’t been tested with 3.0.1.

  45. Dominic says:

    Hi Stephanie,

    Thank you for writing this plugin. It has enabled me to migrate a large legacy site to WordPress in just a couple of days. I can certainly confirm that it works with WP V3.0.

    All the best,

    Dominic

  46. Derek Lim says:

    Hi Stephanie,

    Your plugin is an absolute lifesaver. I was prepared to have to fork out large sums of money to convert my legacy site (1000+ articles) to WordPress. I will probably still have to do lots of work to get rid of unnecessary code/formatting, but 70% of the job is already done by your plugin.

    It definitely works with WP 3.01 and with no issues thus far.

    Derek

  47. woodwolf says:

    Can’t see download button on http://wordpress.org/extend/plugins/import-html-pages/
    Can you give me a link ?

    • steph says:

      Woodwolf, the WordPress site administrators have been working on the plugin repository for the last few weeks. The missing download buttons were a temporary glitch. You should be able to download the plugin now.

Leave a Reply