Word to HTML
Word to HTML
Javadocx Advanced and Premium licenses include the functionality of transforming DOCX files to HTML with native Java classes.
There are currently two ways to transform Word to HTML with Javadocx:
- With the conversion plugin
- With the TransformDocAdvHTML native Java class
The conversion plugin executes LibreOffice or OpenOffice to perform the conversion. This method has a disadvantage: it is not native Java and requires calling external programs, besides, it doesn't allow to customize the output but with Java DOM modifications after the conversion.
The native Java classes included in Advanced and Premium licenses allow to transform DOCX to HTML with Java exclusively. The main features of this functionality are the following:
- Conversion of contents, styles and properties
- Native Java classes
- Easily customizable
- Transform DOCX created from scratch and templates
The transformation can be done using just three lines of code:
where document.docx can be a DOCX created with Javadocx or from other source (MS Word, LibreOffice, etc). Premium licenses can also transform in-memory documents.
Javadocx parses contents, styles, properties and other XML contents.
The list of currently parsed contents and styles include (OOXML content/style and HTML/CSS transformation):
-
document (w:body) : <body>
- background color (w:background) => w:color (background-color)
- background image (v:background) => id (background-image)
- border (w:pgBorders) => w:top (border-top), w:bottom (border-bottom), w:left (border-left), w:right (border-right): w:color (border-color: #HEX), w:sz (border-width), w:val (border-style: nil, none, dashed, dotted, double, solid), w:space (padding)
-
sections (w:sectPr) : <section>
- size (w:pgSz) => w:w (max-width)
- margin (w:pgMar) => w:top (margin-top), w:bottom (margin-bottom), w:left (margin-left), w:right (margin-right)
-
title and metas (cp:coreProperties) : <title>, <meta>
- title (dc:title) => <title>
- author (dc:creator) => <meta> (author)
- description (dc:description) => <meta> (description)
- keywords (cp:keywords) => <meta> (keywords)
-
text strings (w:t) and text styles (w:rPr) : <span>
- text (w:t) => <span>
- bold (w:b) => w:val (font-weight: bold)
- color (w:color) => w:val (color: #HEX)
- double line through (w:dstrike) => w:val (text-decoration-style: double)
- font family (w:rFonts) => w:ascii (font-family)
- font size (w:sz) => w:val (font-size)
- highlight (w:highlight) => w:val (background-color)
- italic (w:i) => w:val (font-style: italic)
- line through (w:strike) => w:on (text-decoration: line-through)
- lower case (w:smallCaps) => w:val (text-transform: lowercase)
- text decoration (w:u) => w:val (text-decoration: none or underline; text-decoration-style: dashed, dotted, double, solid, wavy, none)
- upper case (w:caps) => w:val (text-transform: uppercase)
- vertical align (w:vertAlign) => w:val (vertical-align: sub; vertical-align: super)
-
paragraphs (w:pPr) : <p>
- background color (w:shd) => w:shd (background-color)
- bold (w:b) => w:val (font-weight: bold)
- border (w:pBdr) => w:top (border-top), w:bottom (border-bottom), w:left (border-left), w:right (border-right), w:color (border-color: #HEX), w:sz (border-width), w:val (border-style: nil, none, dashed, dotted, double, solid), w:space (padding)
- color (w:color) => w:val (color: #HEX)
- double line-through (w:dstrike) => w:val (text-decoration-style: double)
- font family (w:rFonts) => w:ascii (font-family)
- font size (w:sz) => w:val (font-size)
- heading (w:outlineLvl) => w:val (h1, h2, h3, h4, h5, h6)
- highlight (w:highlight) => w:val (background-color)
- italic (w:i) => w:val (font-style: italic)
- line height (w:spacing) => w:line (line-height)
- line through (w:strike) => w:on (text-decoration: line-through)
- lower case (w:smallCaps) => w:val (text-transform: lowercase)
- margin (w:ind, w:spacing) => w:left (margin-left), w:start (margin-left), w:right (margin-right), w:end (margin-right), w:after (margin-bottom), w:before (margin-top)
- padding (w:hanging) => w:hanging (padding-left, text-indent)
- page break (w:pageBreakBefore) => w:val (page-break-before: always)
- text align (w:jc) => w:val (text-align: left, justify, center, right)
- text decoration (w:u) => w:val (text-decoration: none or underline; text-decoration-style: dashed, dotted, double, solid, wavy, none)
- text indent (w:firstLine) => w:firstLine (text-indent)
- upper case (w:caps) => w:val (text-transform: uppercase)
- vertical-align (w:vertAlign) => w:val (vertical-align: sub; vertical-align: super)
- word wrap (w:shd) => w:val (word-wrap: break-word)
-
images (w:drawing) : <img>
- border (a:ln) => w (width), a:prstDash (style: dashed, dotted, solid), a:srgbClr (color)
- float (wp:positionH, wp:align) => right (float: right), left (float: left), center (display:block; margin-left: auto; margin-right: auto)
- height (wp:extent) => cy (height)
- link (a:hlinkClick) => r:id (href)
- margin (wp:effectExtent, wp:positionH, wp:positionV) => t (margin-top), r (margin-right), b (margin-bottom), l (margin-left), wp:positionH wp:posOffset (margin-left), wp:positionV wp:posOffset (margin-top)
- text wrapping (wp:inline, wp:anchor) => wp:inline (display: inline), wp:wrapSquare (float: left), wp:wrapNone behindDoc (position: absolute; z-index: -1)
- width (wp:extent) => cx (width)
- src (r:embed, r:link) => embedded and linked images
- saved as files or as base64 (only for embedded images)
-
lists (w:numPr) : <ul>, <ol>, <li>
- type (w:numId) => w:val and w:ilvl (list-style-type: disc, decimal, lower-alpha, lower-roman, upper-alpha, upper-roman)
- view paragraphs elements for other styles
- some styles such as color or font sizes can be inherited to the li content from the li symbol. In this case, the content must have its own style
-
links : <a>
- bookmark (w:bookmarkStart, w:bookmarkEnd) => w:name (<a>)
- cross-reference (w:instrText) => PAGEREF (<a>)
- link (w:instrText) => HYPERLINK (<a>)
-
form elements
- checkbox (w:instrText) => (<input> checkbox)
- date (w:date) => (<input> date)
- input (w:instrText) => (<input> text)
- select (w:instrText, w:comboBox) => (<select>)
-
styles (view elements on this same page for supported styles)
- character/run (w:rPr)
- paragraph (w:pPr)
- list (w:pPr, w:numId, w:ilvl)
- table (w:style)
- styles file (w:styles) => character/run (w:rStyle), paragraph and list (w:pStyle), table
- numbering file => list (w:abstractNum)
- default styles (w:docDefaults) => w:rPr, w:pPr
-
tables (w:tbl)
- align (w:jc) => w:val (margin-left, margin-right)
- border (w:tblBorders) => w:top, w:right, w:bottom, w:left (border-: width style [dashed, dotted, double, none, solid] color)
- layout (w:tblLayout) => w:type fixed (table-layout)
- margin (w:tblInd, w:tblpPr) => w:w (margin-left), w:bottomFromText (margin-bottom), w:topFromText (margin-top)
- width (w:tblW) => w:type pct, dxa w:w (width)
- first col style (w:tblStylePr) => w:type (w:rPr styles)
- first row style (w:tblStylePr) => w:type (w:rPr and w:pPr styles)
- last col style (w:tblStylePr) => w:type (w:rPr styles)
- last row style (w:tblStylePr) => w:type (w:rPr and w:pPr styles)
- row height (w:trPr) => w:trHeight (height)
- rowspan (w:vMerge) => w:val restart, continue (rowspan)
- cell background color (w:shd) => w:fill (background-color)
- cell border (w:tcPr) => w:top, w:right, w:bottom, w:left (border-: width style [dashed, dotted, double, none, solid] color)
- cell padding (w:tblCellMar) => w:top (padding-top), w:right (padding-right), w:bottom (padding-bottom), w:left (padding-left)
- cell vertical align (w:vAlign) => top, bottom, center, both and default w:val (vertical-align)
- cell width (w:tcW) => w:w (width)
- colspan (w:gridSpan) => w:val (colspan)
-
other elements
- break (w:br) => (<br>)
- date (w:instrText) => TIME (<span>)
- endnote (w:endnoteReference, w:endnote) => added to the bottom of the page (<span>)
- external file (w:altChunk) => r:id (<a>)
- footer (w:footerReference, w:ftr) => (<footer>) added to the bottom of its section
- header (w:headerReference, w:hdr) => (<header>) added to the top of its section
- footnote (w:footnoteReference) => added to the bottom of the page (<span>)
- math equations (w:altChunk) => Office MathML
- textbox (v:textbox) => (<div>), style (min-height, float, width), fillcolor (background-color), strokecolor (border-color, border-style), strokeweight (border-width)
- The fact that a tag is not parsed does not mean its content disappears from the HTML output. It only implies that their associated OOXML properties are not taken directly into account. Their children and text content will be parsed and rendered with their corresponding styles into the HTML output.
WARNING:
The transforming features included in Javadocx allow to transform complex DOCX documents generated from scratch or using templates. Let's take a look at some samples and their HTML output.
Nearly all the functionalities available for performing DOCX to HTML transformations can be customized.
The two main classes for transformations are: TransformDocAdvHTML y TransformDocAdvHTMLPlugin.
TransformDocAdvHTML is the class for parsing DOCX structures and performs the transformation to HTML. Its constructor receives an object of the TransformDocAdvHTMLPlugin type that sets the export options. This class can be extended to customize the transformation of each element, e.g., transformW_BOOKMARKSTART for bookmarks or transformW_SECTPR for sections.
TransformDocAdvHTMLPlugin allows to generate transformation plugins according to the project requirements. E.g.: inserting images as base64, ignoring sections, customizing conversion factors, setting the method to set export sizes and set CSS, JavaScript and custom HTML. Javadocx includes the TransformDocAdvHTMLDefaultPlugin, the default plugin to perform transformations.
All the available options are thoroughly explained in the API documentation page of the transformDocAdvHTML method.