Converting RTF to Plain text (Unicode)

Keep Open and Learning
Post Reply
星际浪子
Posts: 3597
Joined: 01 May 2009 23:45

Converting RTF to Plain text (Unicode)

Post by 星际浪子 » 27 Jun 2011 14:03

Assume you've got a queue of 47.600 RichTextFormat-files, and you want to convert them all to plain text.
Indeed, I came across that problem today.

Let's work out some of the possibilities:

1. Do it by hand

If we use Wordpad to open and save the RTF-file to another format, we'll be busy an eternity. Good luck.
Next option?

2. Using AutoIt

I discovered AutoIt some time ago. AutoIt makes it fairly easy to automate all kind of window operations.
If you want to see an example, click here (it solves a minesweeper game).

AutoIt is a freeware BASIC-like scripting language designed for automating the Windows GUI and general scripting. # Quote AutoItScript.com

With simple scripts it's a piece of cake to open an RTF-file, and save it as an (Unicode) plain text file. Exactly what I wanted!

AutoItSetOption("WinTitleMatchMode", 2)
; makes AutoIt match the full title
WinActivate("- WordPad")
; Assuming WordPad has alreay been opened
$hFilesFolders = _FileListToArrayEx("mydirectory")
; read mydirectory/ for RTF-files
For $c = 1 To $hFilesFolders[0]
WinWaitActive("- WordPad")
; Wait until the WordPad-window is ready to accept keystrokes
$rtf = StringSplit($hFilesFolders[$c], "\")
; _FileListToArrayEx() returns an output like mydirectory\1.rtf
; we only want 1.rtf, which is in $rtf[2]
Send("^o" & $rtf[2] & "{TAB}r{ENTER}")
; Send() sends keystrokes to the WordPad-window
; CTRL+O > 1.rtf > TAB > R > ENTER will open the RTF-file
Send("!bs{TAB}u{ENTER}{ENTER}")
; ALT+B > S > TAB > U > ENTER > ENTER will save as a Unicode plain text file
NextYou will also need the _FileListToArrayEx() function, which is included in the .au3 script you can download here: download autoit-script.

This script does its job quite well, but firing up the "Open" and "Save" dialog takes approx. 3 seconds each time when you want to list a directory with 47.000 files (and increasing) .. blame Windows. If you want to save 2 seconds you can, instead of using the Open-dialog, use the AutoIt Run() command.

Run("write.exe C:\yourpath\1.rtf")AutoIt is worth the try if you want to convert below 10.000 files (which prolly applies to most of you people). As soon as the file count increases, AutoIt is rather slow to complete the task.

3. Using a COM Object (in VBScript)


I tested a few DLL's:

EasyByte

EasyByte does a good job converting the files, but asks $399 for their RTF-2-HTML (v8) dll. I would've payed $10 for it, but $399 is an INSANE price. Their trial DLL is frigging annoying nagware, which displays a messagebox each time and only allows a certain number of conversions.

Dim XLApp
Set XLApp = CreateObject("EasyByte.RTF2HTMLv8")
XLApp.LicenseKey = "DEMO"
XLApp.RTF_Text = rtftext
XLApp.CleanRTF = "yes"
dim converted
converted = XLApp.ConvertRTFPlain()Make sure that rtftext contains the text of the RTF-file you wish to convert. If you want to see a working example that converts a whole dir, download this VBScript.

Microsoft Word

If you've got Microsoft Word (darn, I haven't, I've got OpenOffice), it's fairly easy to create a batch that'll convert files from RTF to plain text. Here's an example in VB:

Dim objWord As Word.Application
Dim objDoc As Word.Document
Set objWord = New Word.Application
Set objDoc = objWord.Documents.Open("C:\Temp\Test01.rtf")
objDoc.SaveAs "C:\Temp\Test01.txt", wdFormatText
objDoc.Close
objWord.QuitIt's prolly the best solution of them all. Anyway I couldn't use Word to convert my files.

4. Using a PHP script

Since I'm a PHP passionate, I wanted to convert my RTF files to plain text using PHP.
You'll need a RTF-to-Text class, which can be found on phpclasses.org, or you can download the class here.

require("parser.class.php");
$r = new rtf(stripslashes($rtf));
$r->output("html");
$r->parse();
if(count($r->err) == 0)
echo strip_tags($r->out);
The output is HTML, therefor we use strip_tags() to get the plain text output. Since this method isn't bulletproof, and the class lacks some of the RTF layout, I ditched the PHP class.

5. Conclusion

For ****s sake. How difficult can it be! The number of files to convert is the only factor that poses problems in my case. Else the AutoIt-script would've been more than sufficient.
Now I'll have to install Microsoft Word so I can use their COM object which will do the job.

Post Reply