In the last article, I showed you how to extract images from a Word document. In this article, I’m going to use some of the same code and expand it to detect text formatting. I’ll then turn the formatting into HTML. I recently did a project for Envato called WordPress Auto Publisher where a Windows service picks up stored Word documents and uploads it to WordPress. To work with this process, I had to turn the Word document into HTML.
This article will describe how to get the following formatted text from Word:
- Bold
- Underline
- Italics
- Highlighted
- Strike through
- Colored text (if it’s something other than standard black)
Prerequisites
The prerequisites are the same as the last article, so please take a look at the first article in this series before you read this one if you’re unsure where to start. You need the same using statements and OpenXML by Microsoft must be installed from Nuget for your project.
I’m also using the same button event with the same code that retrieves the file and sends it to a function that does the actual translation from Word to HTML. The button is placed on a WPF window. Here is the button’s event function code again.
private void button_Click(object sender, RoutedEventArgs e) { FileStream fs = new FileStream(System.IO.Path.GetDirectoryName(Process.GetCurrentProcess().MainModule.FileName) + @"\TestFiles\testfilewithformatting.docx", FileMode.Open); Body body = null; MainDocumentPart mainPart = null; using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(fs, false)) { mainPart = wdDoc.MainDocumentPart; body = wdDoc.MainDocumentPart.Document.Body; if (body != null) { ConvertWordToHTML(body, mainPart); } } fs.Flush(); fs.Close(); }
Notice that I call a “ConvertWordToHTML” method in this article, and this will be used to loop through our Word document.
Creating the Main Loop for Each Paragraph
If you recall from the last article, Word documents from 2007 to current versions are made up of XML. This XML is how you can parse them without even having Microsoft Office installed on the computer that runs this code.
Word documents are made up of paragraphs that are made up of runs. A paragraph could have 20 runs embedded in it. You need a loop that goes through each paragraph and then you need an embedded second loop that goes through each run.
The code is very similar to the last article’s code except in this one we just need to know the run properties.
private string ConvertWordToHTML(Body content, MainDocumentPart wDoc) { string htmlConvertedString = string.Empty; foreach (Paragraph par in content.Descendants<Paragraph>()) { foreach (Run run in par.Descendants<Run>()) { RunProperties props = run.RunProperties; htmlConvertedString += ApplyTextFormatting(run.InnerText, props); } } return htmlConvertedString; }
Compared to the last article, this method is much smaller but we call a third method “ApplyTextFormatting.” We’ll get to that one in a bit. The important call in this method is retrieving the run properties. This is then assigned to a “props” variable. This has all the properties for the run, including if there are any formatting options. We send the actual text (contained in the InnerText property) and the run’s properties to the ApplyTextFormatting method.
Converting Word Formatting to HTML
Now for the method that does the actual conversion.
private string ApplyTextFormatting(string content, RunProperties property) { StringBuilder buildString = new StringBuilder(content); if (property.Bold != null) { buildString.Insert(0, "<b>"); buildString.Append("</b>"); } if (property.Italic != null) { buildString.Insert(0, "<i>"); buildString.Append("</i>"); } if (property.Underline != null) { buildString.Insert(0, "<u>"); buildString.Append("</u>"); } if (property.Color != null && property.Color.Val != null) { buildString.Insert(0, "<span style=\"color: #" + property.Color.Val + "\">"); buildString.Append("</span>"); } if (property.Highlight != null && property.Highlight.Val != null) { buildString.Insert(0, "<span style=\"background-color: " + property.Highlight.Val + "\">"); buildString.Append("</span>"); } if (property.Strike != null) { buildString.Insert(0, "<s>"); buildString.Append("</s>"); } return buildString.ToString(); }
When a user formats text, the format object is populated with a value. For instance, if text is in bold, the Bold property object isn’t null. That’s all we need to know to convert it to HTML. I used a StringBuilder variable to insert the corresponding HTML tag in front of the text and at the end of the run. When you format text in Word, the formatted text makes up the entire run, so you know that the content you pass to this method is either formatted or not. If it has no formatting, then the text is returned without any formatting. The great thing about this method is that if there is multiple formatting — for instance, bold and underline — the method will add both HTML tags to the content.
Here is the entire page of code including the button event.
using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Windows; using System.Windows.Controls; using System.Windows.Data; using System.Windows.Input; using System.Windows.Media; using System.Windows.Media.Imaging; using System.Windows.Navigation; using System.Windows.Shapes; using DocumentFormat.OpenXml.Packaging; using DocumentFormat.OpenXml.Wordprocessing; using System.Diagnostics; namespace DetectWordFormatting { /// <summary> /// Interaction logic for MainWindow.xaml /// </summary> public partial class MainWindow : Window { public MainWindow() { InitializeComponent(); } private string ConvertWordToHTML(Body content, MainDocumentPart wDoc) { string htmlConvertedString = string.Empty; foreach (Paragraph par in content.Descendants<Paragraph>()) { ParagraphProperties paragraphProperties = par.ParagraphProperties; foreach (Run run in par.Descendants<Run>()) { RunProperties props = run.RunProperties; htmlConvertedString += ApplyTextFormatting(run.InnerText, props); } } return htmlConvertedString; } /// <summary> /// Apply Word style in HTML and return a string with the HTML tags /// </summary> /// <param name="content"> </param> /// <param name="property"> </param> /// <returns>string</returns> private string ApplyTextFormatting(string content, RunProperties property) { StringBuilder buildString = new StringBuilder(content); if (property.Bold != null) { buildString.Insert(0, "<b>"); buildString.Append("</b>"); } if (property.Italic != null) { buildString.Insert(0, "<i>"); buildString.Append("</i>"); } if (property.Underline != null) { buildString.Insert(0, "<u>"); buildString.Append("</u>"); } if (property.Color != null && property.Color.Val != null) { buildString.Insert(0, "<span style=\"color: #" + property.Color.Val + "\">"); buildString.Append("</span>"); } if (property.Highlight != null && property.Highlight.Val != null) { buildString.Insert(0, "<span style=\"background-color: " + property.Highlight.Val + "\">"); buildString.Append("</span>"); } if (property.Strike != null) { buildString.Insert(0, "<s>"); buildString.Append("</s>"); } return buildString.ToString(); } private void button_Click(object sender, RoutedEventArgs e) { FileStream fs = new FileStream(System.IO.Path.GetDirectoryName(Process.GetCurrentProcess().MainModule.FileName) + @"\TestFiles\testfilewithformatting.docx", FileMode.Open); Body body = null; MainDocumentPart mainPart = null; using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(fs, false)) { mainPart = wdDoc.MainDocumentPart; body = wdDoc.MainDocumentPart.Document.Body; if (body != null) { ConvertWordToHTML(body, mainPart); } } fs.Flush(); fs.Close(); } } }
3 Comments
Engin
htmlConvertedString = ApplyTextFormatting(run.InnerText, props); you will only return last line .
jennifer
Nice catch! Thanks! fixed
Convert Word Document to HTML in C#
this code only woks for docx, however errors for doc