Sillybean

Import HTML Pages

This plugin will import a directory of files as either pages or posts, according to configurable settings. You may specify the HTML tag containing the content you want to import (e.g. <body>, <div id=”content”> or <td width=”732″>) or the name of a Dreamweaver template region (e.g. “Main Content”).

If importing pages, the directory hierarchy will be preserved. Directories containing the specified file types will be imported as empty parent pages. Directories that do not contain the specified file types will be ignored.

As files are imported, the resulting IDs, permalinks, and titles will be displayed. On completion, the importer will provide a list of Apache redirects that can be used in your .htaccess file to seamlessly transfer visitors from the old file locations to the new WordPress posts or pages.

Options:

  • import pages or posts
  • specify content and title as HTML tags or Dreamweaver template regions
  • remove a common phrase (such as the site name) from imported titles
  • specify file extensions to import (e.g. html, htm, php)
  • specify directories to exclude (e.g. images, css)
  • if importing pages, specify whether your top-level files should become top-level pages or children of an existing page
  • choose tags, categories, and custom taxonomies
  • choose status, author, and timestamp
  • use meta descriptions as excerpts
  • choose whether to clean up bad (Word, Frontpage) HTML

Requires PHP 5.

Download at wordpress.org »

Support note: If you have a problem importing your files, please provide a screenshot of your import settings and an example of one of your files. Without these things, it’s difficult to diagnose your problem.

Files to be imported

Files to be imported

Imported pages

Imported pages

Options screen

Options screen


new: clean up Word HTML option
Results: imported pages and rewrite rules

Results: imported pages and rewrite rules

122 comments

Sharing this post? The short URL is http://sillybean.net/?p=2321

122 Responses to “Import HTML Pages”

  1. RHCdG says:

    Hi Stephanie,

    I appreciate the time you’re putting to help everybody out, including me. I did, however, send you a sample html-file through the email form on your website; perhaps you did not receive it? Or maybe it did not give you all the information you need? Are there any other settings you need to know to fix this problem? Perhaps it would be useful if you could list the exact details or settings you need to solve this
    _.
    __.
    ___.

    problem?

    Thanks,
    Rutger

    • Stephanie says:

      Hey, Rutger. I’m sorry, I don’t seem to have received your email. However, I think I know what the problem is. Did you by any chance erase the preset options for directories to skip? If so, but this back in:

      .,..

      (Period, comma, two periods.) I think I need to hardcode that into the plugin; otherwise, if you take it out and you’re on a UNIX-based OS, the importer will trip over all the hidden system files.

      Try that out and let me know how it goes!

  2. Stephanie – Thank you for this fabulous tool. I’m using it to import 100s of pages from an old Radio Userland site into a Wordpress blog. This presented a few new problems: multiple posts on the same date are stored on a single html page, and the post date has to be retrieved either from the url or a bolded item in the page.

    I was able to modify your code to handle this, pulling out the code that handles each new page onto its own, and from there extracting the “identify the post and insert it” code into a second new function, which I then installed into a loop. Happy to send it for your review/inclusion if you like. It’s more or less clean.

    • Stephanie says:

      Thanks, Chris, that would be great! I was just thinking earlier today that I need to create some more flexible options for dates and bylines, but it’ll be about two months before I can devote the time.

  3. RHCdG says:

    Hi Stephanie,

    Do you mean I should put .,.. before each directory name in the box under “Skip directories with these names”?

    Like so:

    .,..audio, .,..pdf, .,..pics

    ?

    I tried this, and the import was different than before but I got scared when it started displaying “cgi-bin” hundreds of times during the import… I hit the Stop-button on my browser, and I now have 820 “planned”, i.e. empty pages.

    I know from others in this thread that your tool works miracles, but how I wish I would get it to work for my site!

    Thanks,

    Rutger

    PS I did upgrade to the new version of the plugin!

  4. RHCdG says:

    Hi Stephanie,

    I just emailed you a screenshot; thanks again for your help!

    Rutger

  5. lopo says:

    hi
    nice plugin
    but i have this problem:

    i have one html page in this page i have some article.

    the plugin workin fine for the first article but the other are ignore.

    eseple

    article1
    content

    article2
    content

    article3
    content

    some one can help me
    thanks advanced

  6. David says:

    Hi Stephanie,
    Super great tool. Thanks so much! Quick question – even though i include in the Allowed HTML field, still I can’t get the tool to keep my BRs. I’m using Version 1.21. Thanks for any pointers!
    David

    • David says:

      What I meant to say is:
      Hi Stephanie,
      Super great tool. Thanks so much! Quick question – even though i include the BR tag in the Allowed HTML field, still I can’t get the tool to keep my BRs. I’m using Version 1.21. Thanks for any pointers!
      David

Leave a Reply

Textile formatting is in effect.