tesseract  3.04.01
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
pdfrenderer.cpp
Go to the documentation of this file.
1 // Include automatically generated configuration file if running autoconf.
2 #ifdef HAVE_CONFIG_H
3 #include "config_auto.h"
4 #endif
5 
6 #include "baseapi.h"
7 #include "renderer.h"
8 #include "math.h"
9 #include "strngs.h"
10 #include "tprintf.h"
11 #include "allheaders.h"
12 
13 #ifdef _MSC_VER
14 #include "mathfix.h"
15 #endif
16 
17 /*
18 
19 Design notes from Ken Sharp, with light editing.
20 
21 We think one solution is a font with a single glyph (.notdef) and a
22 CIDToGIDMap which maps all the CIDs to 0. That map would then be
23 stored as a stream in the PDF file, and when flate compressed should
24 be pretty small. The font, of course, will be approximately the same
25 size as the one you currently use.
26 
27 I'm working on such a font now, the CIDToGIDMap is trivial, you just
28 create a stream object which contains 128k bytes (2 bytes per possible
29 CID and your CIDs range from 0 to 65535) and where you currently have
30 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
31 
32 Note that if, in future, you were to use a different (ie not 2 byte)
33 CMap for character codes you could trivially extend the CIDToGIDMap.
34 
35 The following is an explanation of how some of the font stuff works,
36 this may be too simple for you in which case please accept my
37 apologies, its hard to know how much knowledge someone has. You can
38 skip all this anyway, its just for information.
39 
40 The font embedded in a PDF file is usually intended just to be
41 rendered, but extensions allow for at least some ability to locate (or
42 copy) text from a document. This isn't something which was an original
43 goal of the PDF format, but its been retro-fitted, presumably due to
44 popular demand.
45 
46 To do this reliably the PDF file must contain a ToUnicode CMap, a
47 device for mapping character codes to Unicode code points. If one of
48 these is present, then this will be used to convert the character
49 codes into Unicode values. If its not present then the reader will
50 fall back through a series of heuristics to try and guess the
51 result. This is, as you would expect, prone to failure.
52 
53 This doesn't concern you of course, since you always write a ToUnicode
54 CMap, so because you are writing the text in text rendering mode 3 it
55 would seem that you don't really need to worry about this, but in the
56 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
57 attached to a font, so in order to get even copy/paste to work you
58 need to define a font.
59 
60 This is what leads to problems, tools like pdfwrite assume that they
61 are going to be able to (or even have to) modify the font entries, so
62 they require that the font being embedded be valid, and to be honest
63 the font Tesseract embeds isn't valid (for this purpose).
64 
65 
66 To see why lets look at how text is specified in a PDF file:
67 
68 (Test) Tj
69 
70 Now that looks like text but actually it isn't. Each of those bytes is
71 a 'character code'. When it comes to rendering the text a complex
72 sequence of events takes place, which converts the character code into
73 'something' which the font understands. Its entirely possible via
74 character mappings to have that text render as 'Sftu'
75 
76 For simple fonts (PostScript type 1), we use the character code as the
77 index into an Encoding array (256 elements), each element of which is
78 a glyph name, so this gives us a glyph name. We then consult the
79 CharStrings dictionary in the font, that's a complex object which
80 contains pairs of keys and values, you can use the key to retrieve a
81 given value. So we have a glyph name, we then use that as the key to
82 the dictionary and retrieve the associated value. For a type 1 font,
83 the value is a glyph program that describes how to draw the glyph.
84 
85 For CIDFonts, its a little more complicated. Because CIDFonts can be
86 large, using a glyph name as the key is unreasonable (it would also
87 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
88 as the key. CIDs are just numbers.
89 
90 But.... We don't use the character code as the CID. What we do is use
91 a CMap to convert the character code into a CID. We then use the CID
92 to key the CharStrings dictionary and proceed as before. So the 'CMap'
93 is the equivalent of the Encoding array, but its a more compact and
94 flexible representation.
95 
96 Note that you have to use the CMap just to find out how many bytes
97 constitute a character code, and it can be variable. For example you
98 can say if the first byte is 0x00->0x7f then its just one byte, if its
99 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
100 have seen CMaps defining character codes up to 5 bytes wide.
101 
102 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
103 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
104 a Glyph ID (GID) (and the LOCA table) which may well not be anything
105 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
106 the CIDs to GIDs, and we can then use the GID to get the glyph
107 description from the GLYF table of the font.
108 
109 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
110 
111 Looking at the PDF file I was supplied with we see that it contains
112 text like :
113 
114 <0x0075> Tj
115 
116 So we start by taking the character code (117) and look it up in the
117 CMap. Well you don't supply a CMap, you just use the Identity-H one
118 which is predefined. So character code 117 maps to CID 117. Then we
119 use the CIDToGIDMap, again you don't supply one, you just use the
120 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
121 were supplied with only contains 116 glyphs.
122 
123 Now for Latin that's not a huge problem, you can just supply a bigger
124 font. But for more complex languages that *is* going to be more of a
125 problem. Either you need to supply a font which contains glyphs for
126 all the possible CID->GID mappings, or we need to think laterally.
127 
128 Our solution using a TrueType CIDFont is to intervene at the
129 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
130 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
131 looking into now.
132 
133 It would also be possible to have a 'PostScript' (ie type 1 outlines)
134 CIDFont which contained 1 glyph, and a CMap which mapped all character
135 codes to CID 0. The effect would be the same.
136 
137 Its possible (I haven't checked) that the PostScript CIDFont and
138 associated CMap would be smaller than the TrueType font and associated
139 CIDToGIDMap.
140 
141 --- in a followup ---
142 
143 OK there is a small problem there, if I use GID 0 then Acrobat gets
144 upset about it and complains it cannot extract the font. If I set the
145 CIDToGIDMap so that all the entries are 1 instead, its happy. Totally
146 mad......
147 
148 */
149 
150 namespace tesseract {
151 
152 // Use for PDF object fragments. Must be large enough
153 // to hold a colormap with 256 colors in the verbose
154 // PDF representation.
155 const int kBasicBufSize = 2048;
156 
157 // If the font is 10 pts, nominal character width is 5 pts
158 const int kCharWidth = 2;
159 
160 /**********************************************************************
161  * PDF Renderer interface implementation
162  **********************************************************************/
163 
164 TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
165  : TessResultRenderer(outputbase, "pdf") {
166  obj_ = 0;
167  datadir_ = datadir;
168  offsets_.push_back(0);
169 }
170 
171 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
172  offsets_.push_back(objectsize + offsets_.back());
173  obj_++;
174 }
175 
176 void TessPDFRenderer::AppendPDFObject(const char *data) {
177  AppendPDFObjectDIY(strlen(data));
178  AppendString((const char *)data);
179 }
180 
181 // Helper function to prevent us from accidentally writing
182 // scientific notation to an HOCR or PDF file. Besides, three
183 // decimal points are all you really need.
184 double prec(double x) {
185  double kPrecision = 1000.0;
186  double a = round(x * kPrecision) / kPrecision;
187  if (a == -0)
188  return 0;
189  return a;
190 }
191 
192 long dist2(int x1, int y1, int x2, int y2) {
193  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
194 }
195 
196 // Viewers like evince can get really confused during copy-paste when
197 // the baseline wanders around. So I've decided to project every word
198 // onto the (straight) line baseline. All numbers are in the native
199 // PDF coordinate system, which has the origin in the bottom left and
200 // the unit is points, which is 1/72 inch. Tesseract reports baselines
201 // left-to-right no matter what the reading order is. We need the
202 // word baseline in reading order, so we do that conversion here. Returns
203 // the word's baseline origin and length.
204 void GetWordBaseline(int writing_direction, int ppi, int height,
205  int word_x1, int word_y1, int word_x2, int word_y2,
206  int line_x1, int line_y1, int line_x2, int line_y2,
207  double *x0, double *y0, double *length) {
208  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
209  Swap(&word_x1, &word_x2);
210  Swap(&word_y1, &word_y2);
211  }
212  double word_length;
213  double x, y;
214  {
215  int px = word_x1;
216  int py = word_y1;
217  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
218  if (l2 == 0) {
219  x = line_x1;
220  y = line_y1;
221  } else {
222  double t = ((px - line_x2) * (line_x2 - line_x1) +
223  (py - line_y2) * (line_y2 - line_y1)) / l2;
224  x = line_x2 + t * (line_x2 - line_x1);
225  y = line_y2 + t * (line_y2 - line_y1);
226  }
227  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
228  word_x2, word_y2)));
229  word_length = word_length * 72.0 / ppi;
230  x = x * 72 / ppi;
231  y = height - (y * 72.0 / ppi);
232  }
233  *x0 = x;
234  *y0 = y;
235  *length = word_length;
236 }
237 
238 // Compute coefficients for an affine matrix describing the rotation
239 // of the text. If the text is right-to-left such as Arabic or Hebrew,
240 // we reflect over the Y-axis. This matrix will set the coordinate
241 // system for placing text in the PDF file.
242 //
243 // RTL
244 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
245 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
246 void AffineMatrix(int writing_direction,
247  int line_x1, int line_y1, int line_x2, int line_y2,
248  double *a, double *b, double *c, double *d) {
249  double theta = atan2(static_cast<double>(line_y1 - line_y2),
250  static_cast<double>(line_x2 - line_x1));
251  *a = cos(theta);
252  *b = sin(theta);
253  *c = -sin(theta);
254  *d = cos(theta);
255  switch(writing_direction) {
257  *a = -*a;
258  *b = -*b;
259  break;
261  // TODO(jbreiden) Consider using the vertical PDF writing mode.
262  break;
263  default:
264  break;
265  }
266 }
267 
268 // There are some really stupid PDF viewers in the wild, such as
269 // 'Preview' which ships with the Mac. They do a better job with text
270 // selection and highlighting when given perfectly flat baseline
271 // instead of very slightly tilted. We clip small tilts to appease
272 // these viewers. I chose this threshold large enough to absorb noise,
273 // but small enough that lines probably won't cross each other if the
274 // whole page is tilted at almost exactly the clipping threshold.
275 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
276  int *line_x1, int *line_y1,
277  int *line_x2, int *line_y2) {
278  *line_x1 = x1;
279  *line_y1 = y1;
280  *line_x2 = x2;
281  *line_y2 = y2;
282  double rise = abs(y2 - y1) * 72 / ppi;
283  double run = abs(x2 - x1) * 72 / ppi;
284  if (rise < 2.0 && 2.0 < run)
285  *line_y1 = *line_y2 = (y1 + y2) / 2;
286 }
287 
288 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
289  double width, double height) {
290  STRING pdf_str("");
291  double ppi = api->GetSourceYResolution();
292 
293  // These initial conditions are all arbitrary and will be overwritten
294  double old_x = 0.0, old_y = 0.0;
295  int old_fontsize = 0;
296  tesseract::WritingDirection old_writing_direction =
298  bool new_block = true;
299  int fontsize = 0;
300  double a = 1;
301  double b = 0;
302  double c = 0;
303  double d = 1;
304 
305  // TODO(jbreiden) This marries the text and image together.
306  // Slightly cleaner from an abstraction standpoint if this were to
307  // live inside a separate text object.
308  pdf_str += "q ";
309  pdf_str.add_str_double("", prec(width));
310  pdf_str += " 0 0 ";
311  pdf_str.add_str_double("", prec(height));
312  pdf_str += " 0 0 cm /Im1 Do Q\n";
313 
314  int line_x1 = 0;
315  int line_y1 = 0;
316  int line_x2 = 0;
317  int line_y2 = 0;
318 
319  ResultIterator *res_it = api->GetIterator();
320  while (!res_it->Empty(RIL_BLOCK)) {
321  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
322  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
323  old_fontsize = 0; // Every block will declare its fontsize
324  new_block = true; // Every block will declare its affine matrix
325  }
326 
327  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
328  int x1, y1, x2, y2;
329  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
330  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
331  }
332 
333  if (res_it->Empty(RIL_WORD)) {
334  res_it->Next(RIL_WORD);
335  continue;
336  }
337 
338  // Writing direction changes at a per-word granularity
339  tesseract::WritingDirection writing_direction;
340  {
341  tesseract::Orientation orientation;
342  tesseract::TextlineOrder textline_order;
343  float deskew_angle;
344  res_it->Orientation(&orientation, &writing_direction,
345  &textline_order, &deskew_angle);
346  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
347  switch (res_it->WordDirection()) {
348  case DIR_LEFT_TO_RIGHT:
349  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
350  break;
351  case DIR_RIGHT_TO_LEFT:
352  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
353  break;
354  default:
355  writing_direction = old_writing_direction;
356  }
357  }
358  }
359 
360  // Where is word origin and how long is it?
361  double x, y, word_length;
362  {
363  int word_x1, word_y1, word_x2, word_y2;
364  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
365  GetWordBaseline(writing_direction, ppi, height,
366  word_x1, word_y1, word_x2, word_y2,
367  line_x1, line_y1, line_x2, line_y2,
368  &x, &y, &word_length);
369  }
370 
371  if (writing_direction != old_writing_direction || new_block) {
372  AffineMatrix(writing_direction,
373  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
374  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
375  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
376  pdf_str.add_str_double(" ", prec(c)); // . system for all
377  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
378  pdf_str.add_str_double(" ", prec(x)); // .
379  pdf_str.add_str_double(" ", prec(y)); // .
380  pdf_str += (" Tm "); // Place cursor absolutely
381  new_block = false;
382  } else {
383  double dx = x - old_x;
384  double dy = y - old_y;
385  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
386  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
387  pdf_str += (" Td "); // Relative moveto
388  }
389  old_x = x;
390  old_y = y;
391  old_writing_direction = writing_direction;
392 
393  // Adjust font size on a per word granularity. Pay attention to
394  // fontsize, old_fontsize, and pdf_str. We've found that for
395  // in Arabic, Tesseract will happily return a fontsize of zero,
396  // so we make up a default number to protect ourselves.
397  {
398  bool bold, italic, underlined, monospace, serif, smallcaps;
399  int font_id;
400  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
401  &serif, &smallcaps, &fontsize, &font_id);
402  const int kDefaultFontsize = 8;
403  if (fontsize <= 0)
404  fontsize = kDefaultFontsize;
405  if (fontsize != old_fontsize) {
406  char textfont[20];
407  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
408  pdf_str += textfont;
409  old_fontsize = fontsize;
410  }
411  }
412 
413  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
414  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
415  STRING pdf_word("");
416  int pdf_word_len = 0;
417  do {
418  const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
419  if (grapheme && grapheme[0] != '\0') {
420  GenericVector<int> unicodes;
421  UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
422  char utf16[20];
423  for (int i = 0; i < unicodes.length(); i++) {
424  int code = unicodes[i];
425  // Convert to UTF-16BE https://en.wikipedia.org/wiki/UTF-16
426  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
427  tprintf("Dropping invalid codepoint %d\n", code);
428  continue;
429  }
430  if (code < 0x10000) {
431  snprintf(utf16, sizeof(utf16), "<%04X>", code);
432  } else {
433  int a = code - 0x010000;
434  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
435  int low_surrogate = (0x03FF & a) + 0xDC00;
436  snprintf(utf16, sizeof(utf16), "<%04X%04X>",
437  high_surrogate, low_surrogate);
438  }
439  pdf_word += utf16;
440  pdf_word_len++;
441  }
442  }
443  delete []grapheme;
444  res_it->Next(RIL_SYMBOL);
445  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
446  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
447  double h_stretch =
448  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
449  pdf_str.add_str_double("", h_stretch);
450  pdf_str += " Tz"; // horizontal stretch
451  pdf_str += " [ ";
452  pdf_str += pdf_word; // UTF-16BE representation
453  pdf_str += " ] TJ"; // show the text
454  }
455  if (last_word_in_line) {
456  pdf_str += " \n";
457  }
458  if (last_word_in_block) {
459  pdf_str += "ET\n"; // end the text object
460  }
461  }
462  char *ret = new char[pdf_str.length() + 1];
463  strcpy(ret, pdf_str.string());
464  delete res_it;
465  return ret;
466 }
467 
469  char buf[kBasicBufSize];
470  size_t n;
471 
472  n = snprintf(buf, sizeof(buf),
473  "%%PDF-1.5\n"
474  "%%%c%c%c%c\n",
475  0xDE, 0xAD, 0xBE, 0xEB);
476  if (n >= sizeof(buf)) return false;
477  AppendPDFObject(buf);
478 
479  // CATALOG
480  n = snprintf(buf, sizeof(buf),
481  "1 0 obj\n"
482  "<<\n"
483  " /Type /Catalog\n"
484  " /Pages %ld 0 R\n"
485  ">>\n"
486  "endobj\n",
487  2L);
488  if (n >= sizeof(buf)) return false;
489  AppendPDFObject(buf);
490 
491  // We are reserving object #2 for the /Pages
492  // object, which I am going to create and write
493  // at the end of the PDF file.
494  AppendPDFObject("");
495 
496  // TYPE0 FONT
497  n = snprintf(buf, sizeof(buf),
498  "3 0 obj\n"
499  "<<\n"
500  " /BaseFont /GlyphLessFont\n"
501  " /DescendantFonts [ %ld 0 R ]\n"
502  " /Encoding /Identity-H\n"
503  " /Subtype /Type0\n"
504  " /ToUnicode %ld 0 R\n"
505  " /Type /Font\n"
506  ">>\n"
507  "endobj\n",
508  4L, // CIDFontType2 font
509  6L // ToUnicode
510  );
511  if (n >= sizeof(buf)) return false;
512  AppendPDFObject(buf);
513 
514  // CIDFONTTYPE2
515  n = snprintf(buf, sizeof(buf),
516  "4 0 obj\n"
517  "<<\n"
518  " /BaseFont /GlyphLessFont\n"
519  " /CIDToGIDMap %ld 0 R\n"
520  " /CIDSystemInfo\n"
521  " <<\n"
522  " /Ordering (Identity)\n"
523  " /Registry (Adobe)\n"
524  " /Supplement 0\n"
525  " >>\n"
526  " /FontDescriptor %ld 0 R\n"
527  " /Subtype /CIDFontType2\n"
528  " /Type /Font\n"
529  " /DW %d\n"
530  ">>\n"
531  "endobj\n",
532  5L, // CIDToGIDMap
533  7L, // Font descriptor
534  1000 / kCharWidth);
535  if (n >= sizeof(buf)) return false;
536  AppendPDFObject(buf);
537 
538  // CIDTOGIDMAP
539  const int kCIDToGIDMapSize = 2 * (1 << 16);
540  unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize];
541  for (int i = 0; i < kCIDToGIDMapSize; i++) {
542  cidtogidmap[i] = (i % 2) ? 1 : 0;
543  }
544  size_t len;
545  unsigned char *comp =
546  zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len);
547  delete[] cidtogidmap;
548  n = snprintf(buf, sizeof(buf),
549  "5 0 obj\n"
550  "<<\n"
551  " /Length %lu /Filter /FlateDecode\n"
552  ">>\n"
553  "stream\n", (unsigned long)len);
554  if (n >= sizeof(buf)) {
555  lept_free(comp);
556  return false;
557  }
558  AppendString(buf);
559  long objsize = strlen(buf);
560  AppendData(reinterpret_cast<char *>(comp), len);
561  objsize += len;
562  lept_free(comp);
563  const char *endstream_endobj =
564  "endstream\n"
565  "endobj\n";
566  AppendString(endstream_endobj);
567  objsize += strlen(endstream_endobj);
568  AppendPDFObjectDIY(objsize);
569 
570  const char *stream =
571  "/CIDInit /ProcSet findresource begin\n"
572  "12 dict begin\n"
573  "begincmap\n"
574  "/CIDSystemInfo\n"
575  "<<\n"
576  " /Registry (Adobe)\n"
577  " /Ordering (UCS)\n"
578  " /Supplement 0\n"
579  ">> def\n"
580  "/CMapName /Adobe-Identify-UCS def\n"
581  "/CMapType 2 def\n"
582  "1 begincodespacerange\n"
583  "<0000> <FFFF>\n"
584  "endcodespacerange\n"
585  "1 beginbfrange\n"
586  "<0000> <FFFF> <0000>\n"
587  "endbfrange\n"
588  "endcmap\n"
589  "CMapName currentdict /CMap defineresource pop\n"
590  "end\n"
591  "end\n";
592 
593  // TOUNICODE
594  n = snprintf(buf, sizeof(buf),
595  "6 0 obj\n"
596  "<< /Length %lu >>\n"
597  "stream\n"
598  "%s"
599  "endstream\n"
600  "endobj\n", (unsigned long) strlen(stream), stream);
601  if (n >= sizeof(buf)) return false;
602  AppendPDFObject(buf);
603 
604  // FONT DESCRIPTOR
605  const int kCharHeight = 2; // Effect: highlights are half height
606  n = snprintf(buf, sizeof(buf),
607  "7 0 obj\n"
608  "<<\n"
609  " /Ascent %d\n"
610  " /CapHeight %d\n"
611  " /Descent -1\n" // Spec says must be negative
612  " /Flags 5\n" // FixedPitch + Symbolic
613  " /FontBBox [ 0 0 %d %d ]\n"
614  " /FontFile2 %ld 0 R\n"
615  " /FontName /GlyphLessFont\n"
616  " /ItalicAngle 0\n"
617  " /StemV 80\n"
618  " /Type /FontDescriptor\n"
619  ">>\n"
620  "endobj\n",
621  1000 / kCharHeight,
622  1000 / kCharHeight,
623  1000 / kCharWidth,
624  1000 / kCharHeight,
625  8L // Font data
626  );
627  if (n >= sizeof(buf)) return false;
628  AppendPDFObject(buf);
629 
630  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
631  if (n >= sizeof(buf)) return false;
632  FILE *fp = fopen(buf, "rb");
633  if (!fp) {
634  tprintf("Can not open file \"%s\"!\n", buf);
635  return false;
636  }
637  fseek(fp, 0, SEEK_END);
638  long int size = ftell(fp);
639  fseek(fp, 0, SEEK_SET);
640  char *buffer = new char[size];
641  if (fread(buffer, 1, size, fp) != size) {
642  fclose(fp);
643  delete[] buffer;
644  return false;
645  }
646  fclose(fp);
647  // FONTFILE2
648  n = snprintf(buf, sizeof(buf),
649  "8 0 obj\n"
650  "<<\n"
651  " /Length %ld\n"
652  " /Length1 %ld\n"
653  ">>\n"
654  "stream\n", size, size);
655  if (n >= sizeof(buf)) {
656  delete[] buffer;
657  return false;
658  }
659  AppendString(buf);
660  objsize = strlen(buf);
661  AppendData(buffer, size);
662  delete[] buffer;
663  objsize += size;
664  AppendString(endstream_endobj);
665  objsize += strlen(endstream_endobj);
666  AppendPDFObjectDIY(objsize);
667  return true;
668 }
669 
670 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
671  char *filename,
672  long int objnum,
673  char **pdf_object,
674  long int *pdf_object_size) {
675  size_t n;
676  char b0[kBasicBufSize];
677  char b1[kBasicBufSize];
678  char b2[kBasicBufSize];
679  if (!pdf_object_size || !pdf_object)
680  return false;
681  *pdf_object = NULL;
682  *pdf_object_size = 0;
683  if (!filename)
684  return false;
685 
686  L_COMP_DATA *cid = NULL;
687  const int kJpegQuality = 85;
688 
689  // TODO(jbreiden) Leptonica 1.71 doesn't correctly handle certain
690  // types of PNG files, especially if there are 2 samples per pixel.
691  // We can get rid of this logic after Leptonica 1.72 is released and
692  // has propagated everywhere. Bug discussion as follows.
693  // https://code.google.com/p/tesseract-ocr/issues/detail?id=1300
694  int format, sad;
695  findFileFormat(filename, &format);
696  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
697  pixSetSpp(pix, 3);
698  sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
699  } else {
700  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
701  }
702 
703  if (sad || !cid) {
704  l_CIDataDestroy(&cid);
705  return false;
706  }
707 
708  const char *group4 = "";
709  const char *filter;
710  switch(cid->type) {
711  case L_FLATE_ENCODE:
712  filter = "/FlateDecode";
713  break;
714  case L_JPEG_ENCODE:
715  filter = "/DCTDecode";
716  break;
717  case L_G4_ENCODE:
718  filter = "/CCITTFaxDecode";
719  group4 = " /K -1\n";
720  break;
721  case L_JP2K_ENCODE:
722  filter = "/JPXDecode";
723  break;
724  default:
725  l_CIDataDestroy(&cid);
726  return false;
727  }
728 
729  // Maybe someday we will accept RGBA but today is not that day.
730  // It requires creating an /SMask for the alpha channel.
731  // http://stackoverflow.com/questions/14220221
732  const char *colorspace;
733  if (cid->ncolors > 0) {
734  n = snprintf(b0, sizeof(b0),
735  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
736  cid->ncolors - 1, cid->cmapdatahex);
737  if (n >= sizeof(b0)) {
738  l_CIDataDestroy(&cid);
739  return false;
740  }
741  colorspace = b0;
742  } else {
743  switch (cid->spp) {
744  case 1:
745  colorspace = " /ColorSpace /DeviceGray\n";
746  break;
747  case 3:
748  colorspace = " /ColorSpace /DeviceRGB\n";
749  break;
750  default:
751  l_CIDataDestroy(&cid);
752  return false;
753  }
754  }
755 
756  int predictor = (cid->predictor) ? 14 : 1;
757 
758  // IMAGE
759  n = snprintf(b1, sizeof(b1),
760  "%ld 0 obj\n"
761  "<<\n"
762  " /Length %ld\n"
763  " /Subtype /Image\n",
764  objnum, (unsigned long) cid->nbytescomp);
765  if (n >= sizeof(b1)) {
766  l_CIDataDestroy(&cid);
767  return false;
768  }
769 
770  n = snprintf(b2, sizeof(b2),
771  " /Width %d\n"
772  " /Height %d\n"
773  " /BitsPerComponent %d\n"
774  " /Filter %s\n"
775  " /DecodeParms\n"
776  " <<\n"
777  " /Predictor %d\n"
778  " /Colors %d\n"
779  "%s"
780  " /Columns %d\n"
781  " /BitsPerComponent %d\n"
782  " >>\n"
783  ">>\n"
784  "stream\n",
785  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
786  group4, cid->w, cid->bps);
787  if (n >= sizeof(b2)) {
788  l_CIDataDestroy(&cid);
789  return false;
790  }
791 
792  const char *b3 =
793  "endstream\n"
794  "endobj\n";
795 
796  size_t b1_len = strlen(b1);
797  size_t b2_len = strlen(b2);
798  size_t b3_len = strlen(b3);
799  size_t colorspace_len = strlen(colorspace);
800 
801  *pdf_object_size =
802  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
803  *pdf_object = new char[*pdf_object_size];
804  if (!pdf_object) {
805  l_CIDataDestroy(&cid);
806  return false;
807  }
808 
809  char *p = *pdf_object;
810  memcpy(p, b1, b1_len);
811  p += b1_len;
812  memcpy(p, colorspace, colorspace_len);
813  p += colorspace_len;
814  memcpy(p, b2, b2_len);
815  p += b2_len;
816  memcpy(p, cid->datacomp, cid->nbytescomp);
817  p += cid->nbytescomp;
818  memcpy(p, b3, b3_len);
819  l_CIDataDestroy(&cid);
820  return true;
821 }
822 
824  size_t n;
825  char buf[kBasicBufSize];
826  Pix *pix = api->GetInputImage();
827  char *filename = (char *)api->GetInputName();
828  int ppi = api->GetSourceYResolution();
829  if (!pix || ppi <= 0)
830  return false;
831  double width = pixGetWidth(pix) * 72.0 / ppi;
832  double height = pixGetHeight(pix) * 72.0 / ppi;
833 
834  // PAGE
835  n = snprintf(buf, sizeof(buf),
836  "%ld 0 obj\n"
837  "<<\n"
838  " /Type /Page\n"
839  " /Parent %ld 0 R\n"
840  " /MediaBox [0 0 %.2f %.2f]\n"
841  " /Contents %ld 0 R\n"
842  " /Resources\n"
843  " <<\n"
844  " /XObject << /Im1 %ld 0 R >>\n"
845  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
846  " /Font << /f-0-0 %ld 0 R >>\n"
847  " >>\n"
848  ">>\n"
849  "endobj\n",
850  obj_,
851  2L, // Pages object
852  width,
853  height,
854  obj_ + 1, // Contents object
855  obj_ + 2, // Image object
856  3L); // Type0 Font
857  if (n >= sizeof(buf)) return false;
858  pages_.push_back(obj_);
859  AppendPDFObject(buf);
860 
861  // CONTENTS
862  char* pdftext = GetPDFTextObjects(api, width, height);
863  long pdftext_len = strlen(pdftext);
864  unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext);
865  size_t len;
866  unsigned char *comp_pdftext =
867  zlibCompress(pdftext_casted, pdftext_len, &len);
868  long comp_pdftext_len = len;
869  n = snprintf(buf, sizeof(buf),
870  "%ld 0 obj\n"
871  "<<\n"
872  " /Length %ld /Filter /FlateDecode\n"
873  ">>\n"
874  "stream\n", obj_, comp_pdftext_len);
875  if (n >= sizeof(buf)) {
876  delete[] pdftext;
877  lept_free(comp_pdftext);
878  return false;
879  }
880  AppendString(buf);
881  long objsize = strlen(buf);
882  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
883  objsize += comp_pdftext_len;
884  lept_free(comp_pdftext);
885  delete[] pdftext;
886  const char *b2 =
887  "endstream\n"
888  "endobj\n";
889  AppendString(b2);
890  objsize += strlen(b2);
891  AppendPDFObjectDIY(objsize);
892 
893  char *pdf_object;
894  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
895  return false;
896  }
897  AppendData(pdf_object, objsize);
898  AppendPDFObjectDIY(objsize);
899  delete[] pdf_object;
900  return true;
901 }
902 
903 
905  size_t n;
906  char buf[kBasicBufSize];
907 
908  // We reserved the /Pages object number early, so that the /Page
909  // objects could refer to their parent. We finally have enough
910  // information to go fill it in. Using lower level calls to manipulate
911  // the offset record in two spots, because we are placing objects
912  // out of order in the file.
913 
914  // PAGES
915  const long int kPagesObjectNumber = 2;
916  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
917  n = snprintf(buf, sizeof(buf),
918  "%ld 0 obj\n"
919  "<<\n"
920  " /Type /Pages\n"
921  " /Kids [ ", kPagesObjectNumber);
922  if (n >= sizeof(buf)) return false;
923  AppendString(buf);
924  size_t pages_objsize = strlen(buf);
925  for (size_t i = 0; i < pages_.size(); i++) {
926  n = snprintf(buf, sizeof(buf),
927  "%ld 0 R ", pages_[i]);
928  if (n >= sizeof(buf)) return false;
929  AppendString(buf);
930  pages_objsize += strlen(buf);
931  }
932  n = snprintf(buf, sizeof(buf),
933  "]\n"
934  " /Count %d\n"
935  ">>\n"
936  "endobj\n", pages_.size());
937  if (n >= sizeof(buf)) return false;
938  AppendString(buf);
939  pages_objsize += strlen(buf);
940  offsets_.back() += pages_objsize; // manipulation #2
941 
942  // INFO
943  char* datestr = l_getFormattedDate();
944  n = snprintf(buf, sizeof(buf),
945  "%ld 0 obj\n"
946  "<<\n"
947  " /Producer (Tesseract %s)\n"
948  " /CreationDate (D:%s)\n"
949  " /Title (%s)"
950  ">>\n"
951  "endobj\n", obj_, TESSERACT_VERSION_STR, datestr, title());
952  lept_free(datestr);
953  if (n >= sizeof(buf)) return false;
954  AppendPDFObject(buf);
955  n = snprintf(buf, sizeof(buf),
956  "xref\n"
957  "0 %ld\n"
958  "0000000000 65535 f \n", obj_);
959  if (n >= sizeof(buf)) return false;
960  AppendString(buf);
961  for (int i = 1; i < obj_; i++) {
962  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
963  if (n >= sizeof(buf)) return false;
964  AppendString(buf);
965  }
966  n = snprintf(buf, sizeof(buf),
967  "trailer\n"
968  "<<\n"
969  " /Size %ld\n"
970  " /Root %ld 0 R\n"
971  " /Info %ld 0 R\n"
972  ">>\n"
973  "startxref\n"
974  "%ld\n"
975  "%%%%EOF\n",
976  obj_,
977  1L, // catalog
978  obj_ - 1, // info
979  offsets_.back());
980  if (n >= sizeof(buf)) return false;
981  AppendString(buf);
982  return true;
983 }
984 } // namespace tesseract
void AppendData(const char *s, int len)
Definition: renderer.cpp:87
int size() const
Definition: genericvector.h:72
int length() const
Definition: genericvector.h:79
#define TESSERACT_VERSION_STR
Definition: baseapi.h:23
int push_back(T object)
void AppendString(const char *s)
Definition: renderer.cpp:83
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
#define tprintf(...)
Definition: tprintf.h:31
T & back() const
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
static bool UTF8ToUnicode(const char *utf8_str, GenericVector< int > *unicodes)
Definition: unichar.cpp:211
const char * GetInputName()
Definition: baseapi.cpp:948
void Swap(T *p1, T *p2)
Definition: helpers.h:90
long dist2(int x1, int y1, int x2, int y2)
virtual bool EndDocumentHandler()
virtual bool AddImageHandler(TessBaseAPI *api)
const int kBasicBufSize
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
const char * title() const
Definition: renderer.h:80
const int kCharWidth
TessPDFRenderer(const char *outputbase, const char *datadir)
virtual bool BeginDocumentHandler()
double prec(double x)
Definition: strngs.h:44
struct TessBaseAPI TessBaseAPI
Definition: capi.h:69