tesseract 3.04.01

api/pdfrenderer.cpp

Go to the documentation of this file.
00001 // Include automatically generated configuration file if running autoconf.
00002 #ifdef HAVE_CONFIG_H
00003 #include "config_auto.h"
00004 #endif
00005 
00006 #include "baseapi.h"
00007 #include "renderer.h"
00008 #include "math.h"
00009 #include "strngs.h"
00010 #include "tprintf.h"
00011 #include "allheaders.h"
00012 
00013 #ifdef _MSC_VER
00014 #include "mathfix.h"
00015 #endif
00016 
00017 /*
00018 
00019 Design notes from Ken Sharp, with light editing.
00020 
00021 We think one solution is a font with a single glyph (.notdef) and a
00022 CIDToGIDMap which maps all the CIDs to 0. That map would then be
00023 stored as a stream in the PDF file, and when flate compressed should
00024 be pretty small. The font, of course, will be approximately the same
00025 size as the one you currently use.
00026 
00027 I'm working on such a font now, the CIDToGIDMap is trivial, you just
00028 create a stream object which contains 128k bytes (2 bytes per possible
00029 CID and your CIDs range from 0 to 65535) and where you currently have
00030 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
00031 
00032 Note that if, in future, you were to use a different (ie not 2 byte)
00033 CMap for character codes you could trivially extend the CIDToGIDMap.
00034 
00035 The following is an explanation of how some of the font stuff works,
00036 this may be too simple for you in which case please accept my
00037 apologies, its hard to know how much knowledge someone has. You can
00038 skip all this anyway, its just for information.
00039 
00040 The font embedded in a PDF file is usually intended just to be
00041 rendered, but extensions allow for at least some ability to locate (or
00042 copy) text from a document. This isn't something which was an original
00043 goal of the PDF format, but its been retro-fitted, presumably due to
00044 popular demand.
00045 
00046 To do this reliably the PDF file must contain a ToUnicode CMap, a
00047 device for mapping character codes to Unicode code points. If one of
00048 these is present, then this will be used to convert the character
00049 codes into Unicode values. If its not present then the reader will
00050 fall back through a series of heuristics to try and guess the
00051 result. This is, as you would expect, prone to failure.
00052 
00053 This doesn't concern you of course, since you always write a ToUnicode
00054 CMap, so because you are writing the text in text rendering mode 3 it
00055 would seem that you don't really need to worry about this, but in the
00056 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
00057 attached to a font, so in order to get even copy/paste to work you
00058 need to define a font.
00059 
00060 This is what leads to problems, tools like pdfwrite assume that they
00061 are going to be able to (or even have to) modify the font entries, so
00062 they require that the font being embedded be valid, and to be honest
00063 the font Tesseract embeds isn't valid (for this purpose).
00064 
00065 
00066 To see why lets look at how text is specified in a PDF file:
00067 
00068 (Test) Tj
00069 
00070 Now that looks like text but actually it isn't. Each of those bytes is
00071 a 'character code'. When it comes to rendering the text a complex
00072 sequence of events takes place, which converts the character code into
00073 'something' which the font understands. Its entirely possible via
00074 character mappings to have that text render as 'Sftu'
00075 
00076 For simple fonts (PostScript type 1), we use the character code as the
00077 index into an Encoding array (256 elements), each element of which is
00078 a glyph name, so this gives us a glyph name. We then consult the
00079 CharStrings dictionary in the font, that's a complex object which
00080 contains pairs of keys and values, you can use the key to retrieve a
00081 given value. So we have a glyph name, we then use that as the key to
00082 the dictionary and retrieve the associated value. For a type 1 font,
00083 the value is a glyph program that describes how to draw the glyph.
00084 
00085 For CIDFonts, its a little more complicated. Because CIDFonts can be
00086 large, using a glyph name as the key is unreasonable (it would also
00087 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
00088 as the key. CIDs are just numbers.
00089 
00090 But.... We don't use the character code as the CID. What we do is use
00091 a CMap to convert the character code into a CID. We then use the CID
00092 to key the CharStrings dictionary and proceed as before. So the 'CMap'
00093 is the equivalent of the Encoding array, but its a more compact and
00094 flexible representation.
00095 
00096 Note that you have to use the CMap just to find out how many bytes
00097 constitute a character code, and it can be variable. For example you
00098 can say if the first byte is 0x00->0x7f then its just one byte, if its
00099 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
00100 have seen CMaps defining character codes up to 5 bytes wide.
00101 
00102 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
00103 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
00104 a Glyph ID (GID) (and the LOCA table) which may well not be anything
00105 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
00106 the CIDs to GIDs, and we can then use the GID to get the glyph
00107 description from the GLYF table of the font.
00108 
00109 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
00110 
00111 Looking at the PDF file I was supplied with we see that it contains
00112 text like :
00113 
00114 <0x0075> Tj
00115 
00116 So we start by taking the character code (117) and look it up in the
00117 CMap. Well you don't supply a CMap, you just use the Identity-H one
00118 which is predefined. So character code 117 maps to CID 117. Then we
00119 use the CIDToGIDMap, again you don't supply one, you just use the
00120 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
00121 were supplied with only contains 116 glyphs.
00122 
00123 Now for Latin that's not a huge problem, you can just supply a bigger
00124 font. But for more complex languages that *is* going to be more of a
00125 problem. Either you need to supply a font which contains glyphs for
00126 all the possible CID->GID mappings, or we need to think laterally.
00127 
00128 Our solution using a TrueType CIDFont is to intervene at the
00129 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
00130 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
00131 looking into now.
00132 
00133 It would also be possible to have a 'PostScript' (ie type 1 outlines)
00134 CIDFont which contained 1 glyph, and a CMap which mapped all character
00135 codes to CID 0. The effect would be the same.
00136 
00137 Its possible (I haven't checked) that the PostScript CIDFont and
00138 associated CMap would be smaller than the TrueType font and associated
00139 CIDToGIDMap.
00140 
00141 --- in a followup ---
00142 
00143 OK there is a small problem there, if I use GID 0 then Acrobat gets
00144 upset about it and complains it cannot extract the font. If I set the
00145 CIDToGIDMap so that all the entries are 1 instead, its happy. Totally
00146 mad......
00147 
00148 */
00149 
00150 namespace tesseract {
00151 
00152 // Use for PDF object fragments. Must be large enough
00153 // to hold a colormap with 256 colors in the verbose
00154 // PDF representation.
00155 const int kBasicBufSize = 2048;
00156 
00157 // If the font is 10 pts, nominal character width is 5 pts
00158 const int kCharWidth = 2;
00159 
00160 /**********************************************************************
00161  * PDF Renderer interface implementation
00162  **********************************************************************/
00163 
00164 TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
00165     : TessResultRenderer(outputbase, "pdf") {
00166   obj_  = 0;
00167   datadir_ = datadir;
00168   offsets_.push_back(0);
00169 }
00170 
00171 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
00172   offsets_.push_back(objectsize + offsets_.back());
00173   obj_++;
00174 }
00175 
00176 void TessPDFRenderer::AppendPDFObject(const char *data) {
00177   AppendPDFObjectDIY(strlen(data));
00178   AppendString((const char *)data);
00179 }
00180 
00181 // Helper function to prevent us from accidentally writing
00182 // scientific notation to an HOCR or PDF file. Besides, three
00183 // decimal points are all you really need.
00184 double prec(double x) {
00185   double kPrecision = 1000.0;
00186   double a = round(x * kPrecision) / kPrecision;
00187   if (a == -0)
00188     return 0;
00189   return a;
00190 }
00191 
00192 long dist2(int x1, int y1, int x2, int y2) {
00193   return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
00194 }
00195 
00196 // Viewers like evince can get really confused during copy-paste when
00197 // the baseline wanders around. So I've decided to project every word
00198 // onto the (straight) line baseline. All numbers are in the native
00199 // PDF coordinate system, which has the origin in the bottom left and
00200 // the unit is points, which is 1/72 inch. Tesseract reports baselines
00201 // left-to-right no matter what the reading order is. We need the
00202 // word baseline in reading order, so we do that conversion here. Returns
00203 // the word's baseline origin and length.
00204 void GetWordBaseline(int writing_direction, int ppi, int height,
00205                      int word_x1, int word_y1, int word_x2, int word_y2,
00206                      int line_x1, int line_y1, int line_x2, int line_y2,
00207                      double *x0, double *y0, double *length) {
00208   if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
00209     Swap(&word_x1, &word_x2);
00210     Swap(&word_y1, &word_y2);
00211   }
00212   double word_length;
00213   double x, y;
00214   {
00215     int px = word_x1;
00216     int py = word_y1;
00217     double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
00218     if (l2 == 0) {
00219       x = line_x1;
00220       y = line_y1;
00221     } else {
00222       double t = ((px - line_x2) * (line_x2 - line_x1) +
00223                   (py - line_y2) * (line_y2 - line_y1)) / l2;
00224       x = line_x2 + t * (line_x2 - line_x1);
00225       y = line_y2 + t * (line_y2 - line_y1);
00226     }
00227     word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
00228                                                  word_x2, word_y2)));
00229     word_length = word_length * 72.0 / ppi;
00230     x = x * 72 / ppi;
00231     y = height - (y * 72.0 / ppi);
00232   }
00233   *x0 = x;
00234   *y0 = y;
00235   *length = word_length;
00236 }
00237 
00238 // Compute coefficients for an affine matrix describing the rotation
00239 // of the text. If the text is right-to-left such as Arabic or Hebrew,
00240 // we reflect over the Y-axis. This matrix will set the coordinate
00241 // system for placing text in the PDF file.
00242 //
00243 //                           RTL
00244 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
00245 // [ y' ]   [ c d ][ y ]   [ 0 1 ] [-sin cos ][ y ]
00246 void AffineMatrix(int writing_direction,
00247                   int line_x1, int line_y1, int line_x2, int line_y2,
00248                   double *a, double *b, double *c, double *d) {
00249   double theta = atan2(static_cast<double>(line_y1 - line_y2),
00250                        static_cast<double>(line_x2 - line_x1));
00251   *a = cos(theta);
00252   *b = sin(theta);
00253   *c = -sin(theta);
00254   *d = cos(theta);
00255   switch(writing_direction) {
00256     case WRITING_DIRECTION_RIGHT_TO_LEFT:
00257       *a = -*a;
00258       *b = -*b;
00259       break;
00260     case WRITING_DIRECTION_TOP_TO_BOTTOM:
00261       // TODO(jbreiden) Consider using the vertical PDF writing mode.
00262       break;
00263     default:
00264       break;
00265   }
00266 }
00267 
00268 // There are some really stupid PDF viewers in the wild, such as
00269 // 'Preview' which ships with the Mac. They do a better job with text
00270 // selection and highlighting when given perfectly flat baseline
00271 // instead of very slightly tilted. We clip small tilts to appease
00272 // these viewers. I chose this threshold large enough to absorb noise,
00273 // but small enough that lines probably won't cross each other if the
00274 // whole page is tilted at almost exactly the clipping threshold.
00275 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
00276                   int *line_x1, int *line_y1,
00277                   int *line_x2, int *line_y2) {
00278   *line_x1 = x1;
00279   *line_y1 = y1;
00280   *line_x2 = x2;
00281   *line_y2 = y2;
00282   double rise = abs(y2 - y1) * 72 / ppi;
00283   double run = abs(x2 - x1) * 72 / ppi;
00284   if (rise < 2.0 && 2.0 < run)
00285     *line_y1 = *line_y2 = (y1 + y2) / 2;
00286 }
00287 
00288 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
00289                                          double width, double height) {
00290   STRING pdf_str("");
00291   double ppi = api->GetSourceYResolution();
00292 
00293   // These initial conditions are all arbitrary and will be overwritten
00294   double old_x = 0.0, old_y = 0.0;
00295   int old_fontsize = 0;
00296   tesseract::WritingDirection old_writing_direction =
00297       WRITING_DIRECTION_LEFT_TO_RIGHT;
00298   bool new_block = true;
00299   int fontsize = 0;
00300   double a = 1;
00301   double b = 0;
00302   double c = 0;
00303   double d = 1;
00304 
00305   // TODO(jbreiden) This marries the text and image together.
00306   // Slightly cleaner from an abstraction standpoint if this were to
00307   // live inside a separate text object.
00308   pdf_str += "q ";
00309   pdf_str.add_str_double("", prec(width));
00310   pdf_str += " 0 0 ";
00311   pdf_str.add_str_double("", prec(height));
00312   pdf_str += " 0 0 cm /Im1 Do Q\n";
00313 
00314   int line_x1 = 0;
00315   int line_y1 = 0;
00316   int line_x2 = 0;
00317   int line_y2 = 0;
00318 
00319   ResultIterator *res_it = api->GetIterator();
00320   while (!res_it->Empty(RIL_BLOCK)) {
00321     if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
00322       pdf_str += "BT\n3 Tr";     // Begin text object, use invisible ink
00323       old_fontsize = 0;          // Every block will declare its fontsize
00324       new_block = true;          // Every block will declare its affine matrix
00325     }
00326 
00327     if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
00328       int x1, y1, x2, y2;
00329       res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
00330       ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
00331     }
00332 
00333     if (res_it->Empty(RIL_WORD)) {
00334       res_it->Next(RIL_WORD);
00335       continue;
00336     }
00337 
00338     // Writing direction changes at a per-word granularity
00339     tesseract::WritingDirection writing_direction;
00340     {
00341       tesseract::Orientation orientation;
00342       tesseract::TextlineOrder textline_order;
00343       float deskew_angle;
00344       res_it->Orientation(&orientation, &writing_direction,
00345                           &textline_order, &deskew_angle);
00346       if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
00347         switch (res_it->WordDirection()) {
00348           case DIR_LEFT_TO_RIGHT:
00349             writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
00350             break;
00351           case DIR_RIGHT_TO_LEFT:
00352             writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
00353             break;
00354           default:
00355             writing_direction = old_writing_direction;
00356         }
00357       }
00358     }
00359 
00360     // Where is word origin and how long is it?
00361     double x, y, word_length;
00362     {
00363       int word_x1, word_y1, word_x2, word_y2;
00364       res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
00365       GetWordBaseline(writing_direction, ppi, height,
00366                       word_x1, word_y1, word_x2, word_y2,
00367                       line_x1, line_y1, line_x2, line_y2,
00368                       &x, &y, &word_length);
00369     }
00370 
00371     if (writing_direction != old_writing_direction || new_block) {
00372       AffineMatrix(writing_direction,
00373                    line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
00374       pdf_str.add_str_double(" ", prec(a));  // . This affine matrix
00375       pdf_str.add_str_double(" ", prec(b));  // . sets the coordinate
00376       pdf_str.add_str_double(" ", prec(c));  // . system for all
00377       pdf_str.add_str_double(" ", prec(d));  // . text that follows.
00378       pdf_str.add_str_double(" ", prec(x));  // .
00379       pdf_str.add_str_double(" ", prec(y));  // .
00380       pdf_str += (" Tm ");                   // Place cursor absolutely
00381       new_block = false;
00382     } else {
00383       double dx = x - old_x;
00384       double dy = y - old_y;
00385       pdf_str.add_str_double(" ", prec(dx * a + dy * b));
00386       pdf_str.add_str_double(" ", prec(dx * c + dy * d));
00387       pdf_str += (" Td ");                   // Relative moveto
00388     }
00389     old_x = x;
00390     old_y = y;
00391     old_writing_direction = writing_direction;
00392 
00393     // Adjust font size on a per word granularity. Pay attention to
00394     // fontsize, old_fontsize, and pdf_str. We've found that for
00395     // in Arabic, Tesseract will happily return a fontsize of zero,
00396     // so we make up a default number to protect ourselves.
00397     {
00398       bool bold, italic, underlined, monospace, serif, smallcaps;
00399       int font_id;
00400       res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
00401                                  &serif, &smallcaps, &fontsize, &font_id);
00402       const int kDefaultFontsize = 8;
00403       if (fontsize <= 0)
00404         fontsize = kDefaultFontsize;
00405       if (fontsize != old_fontsize) {
00406         char textfont[20];
00407         snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
00408         pdf_str += textfont;
00409         old_fontsize = fontsize;
00410       }
00411     }
00412 
00413     bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
00414     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
00415     STRING pdf_word("");
00416     int pdf_word_len = 0;
00417     do {
00418       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
00419       if (grapheme && grapheme[0] != '\0') {
00420         GenericVector<int> unicodes;
00421         UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
00422         char utf16[20];
00423         for (int i = 0; i < unicodes.length(); i++) {
00424           int code = unicodes[i];
00425           // Convert to UTF-16BE https://en.wikipedia.org/wiki/UTF-16
00426           if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
00427                 tprintf("Dropping invalid codepoint %d\n", code);
00428                 continue;
00429           }
00430           if (code < 0x10000) {
00431             snprintf(utf16, sizeof(utf16), "<%04X>", code);
00432           } else {
00433             int a = code - 0x010000;
00434             int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
00435             int low_surrogate = (0x03FF & a) + 0xDC00;
00436             snprintf(utf16, sizeof(utf16), "<%04X%04X>",
00437                      high_surrogate, low_surrogate);
00438           }
00439           pdf_word += utf16;
00440           pdf_word_len++;
00441         }
00442       }
00443       delete []grapheme;
00444       res_it->Next(RIL_SYMBOL);
00445     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
00446     if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
00447       double h_stretch =
00448           kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
00449       pdf_str.add_str_double("", h_stretch);
00450       pdf_str += " Tz";          // horizontal stretch
00451       pdf_str += " [ ";
00452       pdf_str += pdf_word;       // UTF-16BE representation
00453       pdf_str += " ] TJ";        // show the text
00454     }
00455     if (last_word_in_line) {
00456       pdf_str += " \n";
00457     }
00458     if (last_word_in_block) {
00459       pdf_str += "ET\n";         // end the text object
00460     }
00461   }
00462   char *ret = new char[pdf_str.length() + 1];
00463   strcpy(ret, pdf_str.string());
00464   delete res_it;
00465   return ret;
00466 }
00467 
00468 bool TessPDFRenderer::BeginDocumentHandler() {
00469   char buf[kBasicBufSize];
00470   size_t n;
00471 
00472   n = snprintf(buf, sizeof(buf),
00473                "%%PDF-1.5\n"
00474                "%%%c%c%c%c\n",
00475                0xDE, 0xAD, 0xBE, 0xEB);
00476   if (n >= sizeof(buf)) return false;
00477   AppendPDFObject(buf);
00478 
00479   // CATALOG
00480   n = snprintf(buf, sizeof(buf),
00481                "1 0 obj\n"
00482                "<<\n"
00483                "  /Type /Catalog\n"
00484                "  /Pages %ld 0 R\n"
00485                ">>\n"
00486                "endobj\n",
00487                2L);
00488   if (n >= sizeof(buf)) return false;
00489   AppendPDFObject(buf);
00490 
00491   // We are reserving object #2 for the /Pages
00492   // object, which I am going to create and write
00493   // at the end of the PDF file.
00494   AppendPDFObject("");
00495 
00496   // TYPE0 FONT
00497   n = snprintf(buf, sizeof(buf),
00498                "3 0 obj\n"
00499                "<<\n"
00500                "  /BaseFont /GlyphLessFont\n"
00501                "  /DescendantFonts [ %ld 0 R ]\n"
00502                "  /Encoding /Identity-H\n"
00503                "  /Subtype /Type0\n"
00504                "  /ToUnicode %ld 0 R\n"
00505                "  /Type /Font\n"
00506                ">>\n"
00507                "endobj\n",
00508                4L,         // CIDFontType2 font
00509                6L          // ToUnicode
00510                );
00511   if (n >= sizeof(buf)) return false;
00512   AppendPDFObject(buf);
00513 
00514   // CIDFONTTYPE2
00515   n = snprintf(buf, sizeof(buf),
00516                "4 0 obj\n"
00517                "<<\n"
00518                "  /BaseFont /GlyphLessFont\n"
00519                "  /CIDToGIDMap %ld 0 R\n"
00520                "  /CIDSystemInfo\n"
00521                "  <<\n"
00522                "     /Ordering (Identity)\n"
00523                "     /Registry (Adobe)\n"
00524                "     /Supplement 0\n"
00525                "  >>\n"
00526                "  /FontDescriptor %ld 0 R\n"
00527                "  /Subtype /CIDFontType2\n"
00528                "  /Type /Font\n"
00529                "  /DW %d\n"
00530                ">>\n"
00531                "endobj\n",
00532                5L,         // CIDToGIDMap
00533                7L,         // Font descriptor
00534                1000 / kCharWidth);
00535   if (n >= sizeof(buf)) return false;
00536   AppendPDFObject(buf);
00537 
00538   // CIDTOGIDMAP
00539   const int kCIDToGIDMapSize = 2 * (1 << 16);
00540   unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize];
00541   for (int i = 0; i < kCIDToGIDMapSize; i++) {
00542     cidtogidmap[i] = (i % 2) ? 1 : 0;
00543   }
00544   size_t len;
00545   unsigned char *comp =
00546       zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len);
00547   delete[] cidtogidmap;
00548   n = snprintf(buf, sizeof(buf),
00549                "5 0 obj\n"
00550                "<<\n"
00551                "  /Length %lu /Filter /FlateDecode\n"
00552                ">>\n"
00553                "stream\n", (unsigned long)len);
00554   if (n >= sizeof(buf)) {
00555     lept_free(comp);
00556     return false;
00557   }
00558   AppendString(buf);
00559   long objsize = strlen(buf);
00560   AppendData(reinterpret_cast<char *>(comp), len);
00561   objsize += len;
00562   lept_free(comp);
00563   const char *endstream_endobj =
00564       "endstream\n"
00565       "endobj\n";
00566   AppendString(endstream_endobj);
00567   objsize += strlen(endstream_endobj);
00568   AppendPDFObjectDIY(objsize);
00569 
00570   const char *stream =
00571       "/CIDInit /ProcSet findresource begin\n"
00572       "12 dict begin\n"
00573       "begincmap\n"
00574       "/CIDSystemInfo\n"
00575       "<<\n"
00576       "  /Registry (Adobe)\n"
00577       "  /Ordering (UCS)\n"
00578       "  /Supplement 0\n"
00579       ">> def\n"
00580       "/CMapName /Adobe-Identify-UCS def\n"
00581       "/CMapType 2 def\n"
00582       "1 begincodespacerange\n"
00583       "<0000> <FFFF>\n"
00584       "endcodespacerange\n"
00585       "1 beginbfrange\n"
00586       "<0000> <FFFF> <0000>\n"
00587       "endbfrange\n"
00588       "endcmap\n"
00589       "CMapName currentdict /CMap defineresource pop\n"
00590       "end\n"
00591       "end\n";
00592 
00593   // TOUNICODE
00594   n = snprintf(buf, sizeof(buf),
00595                "6 0 obj\n"
00596                "<< /Length %lu >>\n"
00597                "stream\n"
00598                "%s"
00599                "endstream\n"
00600                "endobj\n", (unsigned long) strlen(stream), stream);
00601   if (n >= sizeof(buf)) return false;
00602   AppendPDFObject(buf);
00603 
00604   // FONT DESCRIPTOR
00605   const int kCharHeight = 2;  // Effect: highlights are half height
00606   n = snprintf(buf, sizeof(buf),
00607                "7 0 obj\n"
00608                "<<\n"
00609                "  /Ascent %d\n"
00610                "  /CapHeight %d\n"
00611                "  /Descent -1\n"       // Spec says must be negative
00612                "  /Flags 5\n"          // FixedPitch + Symbolic
00613                "  /FontBBox  [ 0 0 %d %d ]\n"
00614                "  /FontFile2 %ld 0 R\n"
00615                "  /FontName /GlyphLessFont\n"
00616                "  /ItalicAngle 0\n"
00617                "  /StemV 80\n"
00618                "  /Type /FontDescriptor\n"
00619                ">>\n"
00620                "endobj\n",
00621                1000 / kCharHeight,
00622                1000 / kCharHeight,
00623                1000 / kCharWidth,
00624                1000 / kCharHeight,
00625                8L      // Font data
00626                );
00627   if (n >= sizeof(buf)) return false;
00628   AppendPDFObject(buf);
00629 
00630   n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
00631   if (n >= sizeof(buf)) return false;
00632   FILE *fp = fopen(buf, "rb");
00633   if (!fp) {
00634     tprintf("Can not open file \"%s\"!\n", buf);
00635     return false;
00636   }
00637   fseek(fp, 0, SEEK_END);
00638   long int size = ftell(fp);
00639   fseek(fp, 0, SEEK_SET);
00640   char *buffer = new char[size];
00641   if (fread(buffer, 1, size, fp) != size) {
00642     fclose(fp);
00643     delete[] buffer;
00644     return false;
00645   }
00646   fclose(fp);
00647   // FONTFILE2
00648   n = snprintf(buf, sizeof(buf),
00649                "8 0 obj\n"
00650                "<<\n"
00651                "  /Length %ld\n"
00652                "  /Length1 %ld\n"
00653                ">>\n"
00654                "stream\n", size, size);
00655   if (n >= sizeof(buf)) {
00656     delete[] buffer;
00657     return false;
00658   }
00659   AppendString(buf);
00660   objsize  = strlen(buf);
00661   AppendData(buffer, size);
00662   delete[] buffer;
00663   objsize += size;
00664   AppendString(endstream_endobj);
00665   objsize += strlen(endstream_endobj);
00666   AppendPDFObjectDIY(objsize);
00667   return true;
00668 }
00669 
00670 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
00671                                     char *filename,
00672                                     long int objnum,
00673                                     char **pdf_object,
00674                                     long int *pdf_object_size) {
00675   size_t n;
00676   char b0[kBasicBufSize];
00677   char b1[kBasicBufSize];
00678   char b2[kBasicBufSize];
00679   if (!pdf_object_size || !pdf_object)
00680     return false;
00681   *pdf_object = NULL;
00682   *pdf_object_size = 0;
00683   if (!filename)
00684     return false;
00685 
00686   L_COMP_DATA *cid = NULL;
00687   const int kJpegQuality = 85;
00688 
00689   // TODO(jbreiden) Leptonica 1.71 doesn't correctly handle certain
00690   // types of PNG files, especially if there are 2 samples per pixel.
00691   // We can get rid of this logic after Leptonica 1.72 is released and
00692   // has propagated everywhere. Bug discussion as follows.
00693   // https://code.google.com/p/tesseract-ocr/issues/detail?id=1300
00694   int format, sad;
00695   findFileFormat(filename, &format);
00696   if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
00697     pixSetSpp(pix, 3);
00698     sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
00699   } else {
00700     sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
00701   }
00702 
00703   if (sad || !cid) {
00704     l_CIDataDestroy(&cid);
00705     return false;
00706   }
00707 
00708   const char *group4 = "";
00709   const char *filter;
00710   switch(cid->type) {
00711     case L_FLATE_ENCODE:
00712       filter = "/FlateDecode";
00713       break;
00714     case L_JPEG_ENCODE:
00715       filter = "/DCTDecode";
00716       break;
00717     case L_G4_ENCODE:
00718       filter = "/CCITTFaxDecode";
00719       group4 = "    /K -1\n";
00720       break;
00721     case L_JP2K_ENCODE:
00722       filter = "/JPXDecode";
00723       break;
00724     default:
00725       l_CIDataDestroy(&cid);
00726       return false;
00727   }
00728 
00729   // Maybe someday we will accept RGBA but today is not that day.
00730   // It requires creating an /SMask for the alpha channel.
00731   // http://stackoverflow.com/questions/14220221
00732   const char *colorspace;
00733   if (cid->ncolors > 0) {
00734     n = snprintf(b0, sizeof(b0),
00735                  "  /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
00736                  cid->ncolors - 1, cid->cmapdatahex);
00737     if (n >= sizeof(b0)) {
00738       l_CIDataDestroy(&cid);
00739       return false;
00740     }
00741     colorspace = b0;
00742   } else {
00743     switch (cid->spp) {
00744       case 1:
00745         colorspace = "  /ColorSpace /DeviceGray\n";
00746         break;
00747       case 3:
00748         colorspace = "  /ColorSpace /DeviceRGB\n";
00749         break;
00750       default:
00751         l_CIDataDestroy(&cid);
00752         return false;
00753     }
00754   }
00755 
00756   int predictor = (cid->predictor) ? 14 : 1;
00757 
00758   // IMAGE
00759   n = snprintf(b1, sizeof(b1),
00760                "%ld 0 obj\n"
00761                "<<\n"
00762                "  /Length %ld\n"
00763                "  /Subtype /Image\n",
00764                objnum, (unsigned long) cid->nbytescomp);
00765   if (n >= sizeof(b1)) {
00766     l_CIDataDestroy(&cid);
00767     return false;
00768   }
00769 
00770   n = snprintf(b2, sizeof(b2),
00771                "  /Width %d\n"
00772                "  /Height %d\n"
00773                "  /BitsPerComponent %d\n"
00774                "  /Filter %s\n"
00775                "  /DecodeParms\n"
00776                "  <<\n"
00777                "    /Predictor %d\n"
00778                "    /Colors %d\n"
00779                "%s"
00780                "    /Columns %d\n"
00781                "    /BitsPerComponent %d\n"
00782                "  >>\n"
00783                ">>\n"
00784                "stream\n",
00785                cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
00786                group4, cid->w, cid->bps);
00787   if (n >= sizeof(b2)) {
00788     l_CIDataDestroy(&cid);
00789     return false;
00790   }
00791 
00792   const char *b3 =
00793       "endstream\n"
00794       "endobj\n";
00795 
00796   size_t b1_len = strlen(b1);
00797   size_t b2_len = strlen(b2);
00798   size_t b3_len = strlen(b3);
00799   size_t colorspace_len = strlen(colorspace);
00800 
00801   *pdf_object_size =
00802       b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
00803   *pdf_object = new char[*pdf_object_size];
00804   if (!pdf_object) {
00805     l_CIDataDestroy(&cid);
00806     return false;
00807   }
00808 
00809   char *p = *pdf_object;
00810   memcpy(p, b1, b1_len);
00811   p += b1_len;
00812   memcpy(p, colorspace, colorspace_len);
00813   p += colorspace_len;
00814   memcpy(p, b2, b2_len);
00815   p += b2_len;
00816   memcpy(p, cid->datacomp, cid->nbytescomp);
00817   p += cid->nbytescomp;
00818   memcpy(p, b3, b3_len);
00819   l_CIDataDestroy(&cid);
00820   return true;
00821 }
00822 
00823 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
00824   size_t n;
00825   char buf[kBasicBufSize];
00826   Pix *pix = api->GetInputImage();
00827   char *filename = (char *)api->GetInputName();
00828   int ppi = api->GetSourceYResolution();
00829   if (!pix || ppi <= 0)
00830     return false;
00831   double width = pixGetWidth(pix) * 72.0 / ppi;
00832   double height = pixGetHeight(pix) * 72.0 / ppi;
00833 
00834   // PAGE
00835   n = snprintf(buf, sizeof(buf),
00836                "%ld 0 obj\n"
00837                "<<\n"
00838                "  /Type /Page\n"
00839                "  /Parent %ld 0 R\n"
00840                "  /MediaBox [0 0 %.2f %.2f]\n"
00841                "  /Contents %ld 0 R\n"
00842                "  /Resources\n"
00843                "  <<\n"
00844                "    /XObject << /Im1 %ld 0 R >>\n"
00845                "    /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
00846                "    /Font << /f-0-0 %ld 0 R >>\n"
00847                "  >>\n"
00848                ">>\n"
00849                "endobj\n",
00850                obj_,
00851                2L,            // Pages object
00852                width,
00853                height,
00854                obj_ + 1,      // Contents object
00855                obj_ + 2,      // Image object
00856                3L);           // Type0 Font
00857   if (n >= sizeof(buf)) return false;
00858   pages_.push_back(obj_);
00859   AppendPDFObject(buf);
00860 
00861   // CONTENTS
00862   char* pdftext = GetPDFTextObjects(api, width, height);
00863   long pdftext_len = strlen(pdftext);
00864   unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext);
00865   size_t len;
00866   unsigned char *comp_pdftext =
00867       zlibCompress(pdftext_casted, pdftext_len, &len);
00868   long comp_pdftext_len = len;
00869   n = snprintf(buf, sizeof(buf),
00870                "%ld 0 obj\n"
00871                "<<\n"
00872                "  /Length %ld /Filter /FlateDecode\n"
00873                ">>\n"
00874                "stream\n", obj_, comp_pdftext_len);
00875   if (n >= sizeof(buf)) {
00876     delete[] pdftext;
00877     lept_free(comp_pdftext);
00878     return false;
00879   }
00880   AppendString(buf);
00881   long objsize = strlen(buf);
00882   AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
00883   objsize += comp_pdftext_len;
00884   lept_free(comp_pdftext);
00885   delete[] pdftext;
00886   const char *b2 =
00887       "endstream\n"
00888       "endobj\n";
00889   AppendString(b2);
00890   objsize += strlen(b2);
00891   AppendPDFObjectDIY(objsize);
00892 
00893   char *pdf_object;
00894   if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
00895     return false;
00896   }
00897   AppendData(pdf_object, objsize);
00898   AppendPDFObjectDIY(objsize);
00899   delete[] pdf_object;
00900   return true;
00901 }
00902 
00903 
00904 bool TessPDFRenderer::EndDocumentHandler() {
00905   size_t n;
00906   char buf[kBasicBufSize];
00907 
00908   // We reserved the /Pages object number early, so that the /Page
00909   // objects could refer to their parent. We finally have enough
00910   // information to go fill it in. Using lower level calls to manipulate
00911   // the offset record in two spots, because we are placing objects
00912   // out of order in the file.
00913 
00914   // PAGES
00915   const long int kPagesObjectNumber = 2;
00916   offsets_[kPagesObjectNumber] = offsets_.back();  // manipulation #1
00917   n = snprintf(buf, sizeof(buf),
00918                "%ld 0 obj\n"
00919                "<<\n"
00920                "  /Type /Pages\n"
00921                "  /Kids [ ", kPagesObjectNumber);
00922   if (n >= sizeof(buf)) return false;
00923   AppendString(buf);
00924   size_t pages_objsize  = strlen(buf);
00925   for (size_t i = 0; i < pages_.size(); i++) {
00926     n = snprintf(buf, sizeof(buf),
00927                  "%ld 0 R ", pages_[i]);
00928     if (n >= sizeof(buf)) return false;
00929     AppendString(buf);
00930     pages_objsize += strlen(buf);
00931   }
00932   n = snprintf(buf, sizeof(buf),
00933                "]\n"
00934                "  /Count %d\n"
00935                ">>\n"
00936                "endobj\n", pages_.size());
00937   if (n >= sizeof(buf)) return false;
00938   AppendString(buf);
00939   pages_objsize += strlen(buf);
00940   offsets_.back() += pages_objsize;    // manipulation #2
00941 
00942   // INFO
00943   char* datestr = l_getFormattedDate();
00944   n = snprintf(buf, sizeof(buf),
00945                "%ld 0 obj\n"
00946                "<<\n"
00947                "  /Producer (Tesseract %s)\n"
00948                "  /CreationDate (D:%s)\n"
00949                "  /Title (%s)"
00950                ">>\n"
00951                "endobj\n", obj_, TESSERACT_VERSION_STR, datestr, title());
00952   lept_free(datestr);
00953   if (n >= sizeof(buf)) return false;
00954   AppendPDFObject(buf);
00955   n = snprintf(buf, sizeof(buf),
00956                "xref\n"
00957                "0 %ld\n"
00958                "0000000000 65535 f \n", obj_);
00959   if (n >= sizeof(buf)) return false;
00960   AppendString(buf);
00961   for (int i = 1; i < obj_; i++) {
00962     n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
00963     if (n >= sizeof(buf)) return false;
00964     AppendString(buf);
00965   }
00966   n = snprintf(buf, sizeof(buf),
00967                "trailer\n"
00968                "<<\n"
00969                "  /Size %ld\n"
00970                "  /Root %ld 0 R\n"
00971                "  /Info %ld 0 R\n"
00972                ">>\n"
00973                "startxref\n"
00974                "%ld\n"
00975                "%%%%EOF\n",
00976                obj_,
00977                1L,               // catalog
00978                obj_ - 1,         // info
00979                offsets_.back());
00980   if (n >= sizeof(buf)) return false;
00981   AppendString(buf);
00982   return true;
00983 }
00984 }  // namespace tesseract
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines