Doc-Comp Platform
This is the first project I started once I left the world of permanent employment, and easily the most difficult/exciting from a software engineering perspective. The ultimate aim was (and still is!) to create a platform that can both read and write to a variety of different document formats, with the functionality to also modify the documents regardless of the format that they're presented in.
In order to achieve this feat, I decided to design the system centered around a native intermediate format (IF), and by reading, writing, and modifying the native IF, I would be able to build and expand the feature set over time. In order to get a handle on the ballooning size of the project, I decided to break off the workflow and scripting feature and concentrate on what I believed to be a minimum viable product - the Solowing SaaS platform.

Current Status
Parsers/Interpreters
Due to the large number of operators/fields, I'm simply going to list them comma-delimited, along with the percentage of completion. The Postscript interpreter was my main focus during development, simply because I would also need it for the PDF, the AFP parser (in the form of an IOB field), as well as font support where charstring access requires the interpreter (most prominent being type 3). It's also the most difficult to develop, since it's more than just a document format; it's a turing-complete programming language.
Postscript (71.01% - 294 / 414 operators implemented) | AFP (32.37% - 56 / 173 fields/codes implemented [of MO:DCA, IOCA, GOCA]) |
---|---|
abs, add, aload, anchorsearch, and, arc, arcn, arct, arcto, array, ashow, astore, atan, awidthshow, begin, bind, bitshift, bytesavailable, cachestatus, ceiling, clear, cleartomark, cleardictstack, clip, clippath, closefile, closepath, concat, concatmatrix, configurationerror, copy, copypage, cos, count, countdictstack, countexecstack, counttomark, cshow, currentcacheparams, currentcmykcolor, currentcolor, currentcolorspace, currentdash, currentdevparams, currentdict, currentfile, currentflat, currentfont, currentgray, currenthsbcolor, currentlinecap, currentlinejoin, currentlinewidth, currentmatrix, currentmiterlimit, currentpoint, currentrgbcolor, currentstrokeadjust, currentsystemparams, currentuserparams, curveto, cvi, cvlit, cvn, cvr, cvrs, cvs, cvx, def, defaultmatrix, definefont, deletefile, dict, dictfull, dictstack, dictstackoverflow, dictstackunderflow, div, dtransform, dup, eexec, end, eoclip, eoviewclip, eq, erasepage, errordict, exch, exec, execstack, execstackoverflow, executeonly, exit, exp, false, file, filenameforall, fileposition, fill, filter, findfont, flattenpath, floor, flush, flushfile, FontDirectory, for, forall, ge, get, getinterval, globaldict, grestore, gsave, gstate, gt, handleerror, identmatrix, idiv, idtransform, if, ifelse, index, initclip, initgraphics, initmatrix, initviewclip, internaldict, interrupt, invalidaccess, invalidexit, invalidfileaccess, invalidfont, invalidrestore, invertmatrix, ioerror, ISOLatin1Encoding, itransform, kshow, known, languagelevel, le, length, limitcheck, lineto, ln, load, log, loop, lt, mark, matrix, maxlength, mod, moveto, mul, ne, neg, newpath, noaccess, nocurrentpoint, not, or, packedarray, pop, print, product, pstack, put, putinterval, quit, rand, rangecheck, rcurveto, read, readhexstring, readline, readonly, readstring, rectclip, renamefile, repeat, resetfile, reversepath, revision, rlineto, rmoveto, roll, rootfont, rotate, round, rrand, run, save, scale, scalefont, search, serialnumber, setcachelimit, setcacheparams, setcmykcolor, setcolorspace, setdash, setdevparams, setfileposition, setflat, setgray, sethsbcolor, setlinecap, setlinejoin, setlinewidth, setmatrix, setmiterlimit, setrgbcolor, setstrokeadjust, setsystemparams, setucacheparams, setuserparams, setvmthreshold, shareddict, show, showpage, sin, sqrt, srand, stack, stackoverflow, stackunderflow, StandardEncoding, start, status, statusdict, stop, stopped, store, string, stringwidth, stroke, sub, syntaxerror, systemdict, timeout, transform, translate, true, truncate, type, typecheck, token, ucachestatus, undef, undefined, undefinedfilename, undefinedresult, undefinedresource, unmatchedmark, unregistered, userdict, usertime, version, viewclip, VMerror, vmreclaim, vmstatus, where, widthshow, write, writehexstring, xor, xshow, xyshow, yshow, clipsave, cliprestore, setsmoothness, currentsmoothness, put, put, aload, forall, get, length, forall, get, length, forall, get, length, copy, copy, copy, getinterval, putinterval, token | BDT, BGR, BIM, BOG, BPG, EDT, EGR, EIM, EOG, EPG, GAD, GDD, OBD, OBP, GOCA(GNOP1), GOCA(GCOMT), GOCA(GSGCH), GOCA(GSPS), GOCA(GSCOL), GOCA(GSMX), GOCA(GSBMX), GOCA(GSFLW), GOCA(GSLT), GOCA(GSLW), GOCA(GSLE), GOCA(GSLJ), GOCA(GSCP), GOCA(GSAP), GOCA(GSECOL), GOCA(GSPT), GOCA(GSMT), GOCA(GSMC), GOCA(GSMP), GOCA(GSMS), GOCA(GEPROL), GOCA(GEAR), GOCA(GBAR), GOCA(GCBOX), GOCA(GCLINE), GOCA(GCMRK), GOCA(GCFLT), GOCA(GCFARC), GOCA(GCBIMG), GOCA(GIMD), GOCA(GEIMG), GOCA(GCRLINE), GOCA(GCCBEZ), GOCA(GSPCOL), GOCA(GBOX), GOCA(GLINE), GOCA(GMRK), GOCA(GFLT), GOCA(GFARC), GOCA(GBIMG), GOCA(GRLINE), GOCA(GCBEZ) |
Writers
The only real progress on writers has been made on the image renderer, since it's required for producing rasterized output for all document formats, as well as for proofing output. Using fractals for testing the base interpreter functionality was incredibly useful, specifically for testing the call stack and memory management. Since they also rely on most of the basic drawing primitives, it also allowed the perfect testing bed for the graphics state. I have plans to utilise Apache FOP for writing to various formats, and have already conducted various tests writing directly to the Apache FOP IF; I'm utilising IKVM.net for executing it within the .NET environment, however this is only a temporary measure until the native output writers have been completed (thankfully writing to these document formats is a lot easier than reading from them).
Below are some nice fractals and various other images that have been parsed and rendered from postscript using the doc-comp postscript interpreter and image renderer (with the original postscript found beneath each image, with the image being cropped from the original A4 output for display purposes).
Font Library
The font library is one of the more complicated aspects of the doc-comp platform, and specific constraints have meant that I've been unable to simply leverage existing FOSS libraries for the typeface rendering. These constraints center primarily around postscript fonts, specifically regarding the access requirement for a postscript interpreter to process certain charstring operators. Thankfully all the common font types (specifically Type1/CFF, TrueType and OpenType) are well documented, and so implementing a common font library that stores the common glyph information hasn't really been all that complicated (so far, at least). I've left the OpenType format for now, since it's pretty much TrueType and Type1 smushed together, so that will come shortly after the basic prerequisites of those font types have been completed; the Type 1 and TrueType parsers are pretty much complete minus the hinting functionality.
When a font is parsed by the font library, it's converted into an intermediate format. Since there is a lack of parity when it comes to font support across document formats, this facilitates writing fonts out to a format supported by the target document format. This also provides the facility for writing partial fonts, allowing for improved parsing and lower memory consumption. I knocked up a very basic WinForms application for viewing converted fonts, along with a glyph viewer for detailing metric information, which you can see below the completion table.
Type1 (incl. Type2/CFF charstring) (74.14% - 43 / 58 operators implemented) - All except hinting | TrueType (40% - 10 / 25 total features implemented, 100% - 10 / 10 mandatory features implemented) |
---|---|
abs, add, and, callothersubr, callsubr, closepath, div, drop, dup, endchar, escape, exch, get, hhcurveto, hlineto, hmoveto, hsbw, hvcurveto, ifelse, index, mul, neg, not, or, put, random, rcurveline, return, rlinecurve, rlineto, rmoveto, roll, rrcurveto, sbw, seac, setcurrentpoint, sqrt, sub, vhcurveto, vlineto, vmoveto, vvcurveto, shortint | cmap, glyf, head, hhea, hmtx, loca, maxp, name, post, os/2, cvt, ebdt, eblc, ebsc, fpgm, gasp, hdmx, kern, ltsh, prep, pclt, vdmx, vhea, vmtx |

Intermediate Format
In order to ensure like-for-like reading, writing and conversion of documents, it's important that a document is first converted into a native intermediate format. I decided that rather than simply serializing/deserializing the data, I'd instead create my own document format that leveraged what I felt were the best parts of existing document formats, while also focusing on overcoming some of the shortcomings that I personally come across while dealing with those formats. I won't go too much into the details, instead I'll note some of the more prominent features of the format:
- Binary data structure, utilising structured fields for identifying document elements.
- Pointer-based linked-list structure, providing optimised indexing and archive/revision control as-standard.
- Standard and compact format - standard format optimised for read/write (editing), and compact format optimised for sequential read/write (output conversion).
- Features such as data extraction, modification and imposition. All document elements are stored with 2D transform information (including text bounding-boxes), allowing for accurate search and extract functionality.
- File-based memory model (in-memory also available) - designed to scale with physical storage, rather than RAM. Optimised to operate in environments that scale horizontally (cloud/multi-tenant).
The intermediary format logic is self-contained within it's own library assembly, which allows not only a variety of document parsers/writers to consume/write the IF directly, but also direct document manipulation via a dedicated GUI document editor, or direct scripting interface; both of which are tools currently planned for inclusion within the Solowing SaaS Platform.