|
tesseract 3.04.01
|
00001 // Include automatically generated configuration file if running autoconf. 00002 #ifdef HAVE_CONFIG_H 00003 #include "config_auto.h" 00004 #endif 00005 00006 #include "baseapi.h" 00007 #include "renderer.h" 00008 #include "math.h" 00009 #include "strngs.h" 00010 #include "tprintf.h" 00011 #include "allheaders.h" 00012 00013 #ifdef _MSC_VER 00014 #include "mathfix.h" 00015 #endif 00016 00017 /* 00018 00019 Design notes from Ken Sharp, with light editing. 00020 00021 We think one solution is a font with a single glyph (.notdef) and a 00022 CIDToGIDMap which maps all the CIDs to 0. That map would then be 00023 stored as a stream in the PDF file, and when flate compressed should 00024 be pretty small. The font, of course, will be approximately the same 00025 size as the one you currently use. 00026 00027 I'm working on such a font now, the CIDToGIDMap is trivial, you just 00028 create a stream object which contains 128k bytes (2 bytes per possible 00029 CID and your CIDs range from 0 to 65535) and where you currently have 00030 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R". 00031 00032 Note that if, in future, you were to use a different (ie not 2 byte) 00033 CMap for character codes you could trivially extend the CIDToGIDMap. 00034 00035 The following is an explanation of how some of the font stuff works, 00036 this may be too simple for you in which case please accept my 00037 apologies, its hard to know how much knowledge someone has. You can 00038 skip all this anyway, its just for information. 00039 00040 The font embedded in a PDF file is usually intended just to be 00041 rendered, but extensions allow for at least some ability to locate (or 00042 copy) text from a document. This isn't something which was an original 00043 goal of the PDF format, but its been retro-fitted, presumably due to 00044 popular demand. 00045 00046 To do this reliably the PDF file must contain a ToUnicode CMap, a 00047 device for mapping character codes to Unicode code points. If one of 00048 these is present, then this will be used to convert the character 00049 codes into Unicode values. If its not present then the reader will 00050 fall back through a series of heuristics to try and guess the 00051 result. This is, as you would expect, prone to failure. 00052 00053 This doesn't concern you of course, since you always write a ToUnicode 00054 CMap, so because you are writing the text in text rendering mode 3 it 00055 would seem that you don't really need to worry about this, but in the 00056 PDF spec you cannot have an isolated ToUnicode CMap, it has to be 00057 attached to a font, so in order to get even copy/paste to work you 00058 need to define a font. 00059 00060 This is what leads to problems, tools like pdfwrite assume that they 00061 are going to be able to (or even have to) modify the font entries, so 00062 they require that the font being embedded be valid, and to be honest 00063 the font Tesseract embeds isn't valid (for this purpose). 00064 00065 00066 To see why lets look at how text is specified in a PDF file: 00067 00068 (Test) Tj 00069 00070 Now that looks like text but actually it isn't. Each of those bytes is 00071 a 'character code'. When it comes to rendering the text a complex 00072 sequence of events takes place, which converts the character code into 00073 'something' which the font understands. Its entirely possible via 00074 character mappings to have that text render as 'Sftu' 00075 00076 For simple fonts (PostScript type 1), we use the character code as the 00077 index into an Encoding array (256 elements), each element of which is 00078 a glyph name, so this gives us a glyph name. We then consult the 00079 CharStrings dictionary in the font, that's a complex object which 00080 contains pairs of keys and values, you can use the key to retrieve a 00081 given value. So we have a glyph name, we then use that as the key to 00082 the dictionary and retrieve the associated value. For a type 1 font, 00083 the value is a glyph program that describes how to draw the glyph. 00084 00085 For CIDFonts, its a little more complicated. Because CIDFonts can be 00086 large, using a glyph name as the key is unreasonable (it would also 00087 lead to unfeasibly large Encoding arrays), so instead we use a 'CID' 00088 as the key. CIDs are just numbers. 00089 00090 But.... We don't use the character code as the CID. What we do is use 00091 a CMap to convert the character code into a CID. We then use the CID 00092 to key the CharStrings dictionary and proceed as before. So the 'CMap' 00093 is the equivalent of the Encoding array, but its a more compact and 00094 flexible representation. 00095 00096 Note that you have to use the CMap just to find out how many bytes 00097 constitute a character code, and it can be variable. For example you 00098 can say if the first byte is 0x00->0x7f then its just one byte, if its 00099 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I 00100 have seen CMaps defining character codes up to 5 bytes wide. 00101 00102 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for 00103 TrueType CIDFonts. The thing is that TrueType fonts are accessed using 00104 a Glyph ID (GID) (and the LOCA table) which may well not be anything 00105 like the CID. So for this case PDF includes a CIDToGIDMap. That maps 00106 the CIDs to GIDs, and we can then use the GID to get the glyph 00107 description from the GLYF table of the font. 00108 00109 So for a TrueType CIDFont, character-code->CID->GID->glyf-program. 00110 00111 Looking at the PDF file I was supplied with we see that it contains 00112 text like : 00113 00114 <0x0075> Tj 00115 00116 So we start by taking the character code (117) and look it up in the 00117 CMap. Well you don't supply a CMap, you just use the Identity-H one 00118 which is predefined. So character code 117 maps to CID 117. Then we 00119 use the CIDToGIDMap, again you don't supply one, you just use the 00120 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we 00121 were supplied with only contains 116 glyphs. 00122 00123 Now for Latin that's not a huge problem, you can just supply a bigger 00124 font. But for more complex languages that *is* going to be more of a 00125 problem. Either you need to supply a font which contains glyphs for 00126 all the possible CID->GID mappings, or we need to think laterally. 00127 00128 Our solution using a TrueType CIDFont is to intervene at the 00129 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a 00130 font with just one glyph, the .notdef glyph at GID 0. This is what I'm 00131 looking into now. 00132 00133 It would also be possible to have a 'PostScript' (ie type 1 outlines) 00134 CIDFont which contained 1 glyph, and a CMap which mapped all character 00135 codes to CID 0. The effect would be the same. 00136 00137 Its possible (I haven't checked) that the PostScript CIDFont and 00138 associated CMap would be smaller than the TrueType font and associated 00139 CIDToGIDMap. 00140 00141 --- in a followup --- 00142 00143 OK there is a small problem there, if I use GID 0 then Acrobat gets 00144 upset about it and complains it cannot extract the font. If I set the 00145 CIDToGIDMap so that all the entries are 1 instead, its happy. Totally 00146 mad...... 00147 00148 */ 00149 00150 namespace tesseract { 00151 00152 // Use for PDF object fragments. Must be large enough 00153 // to hold a colormap with 256 colors in the verbose 00154 // PDF representation. 00155 const int kBasicBufSize = 2048; 00156 00157 // If the font is 10 pts, nominal character width is 5 pts 00158 const int kCharWidth = 2; 00159 00160 /********************************************************************** 00161 * PDF Renderer interface implementation 00162 **********************************************************************/ 00163 00164 TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir) 00165 : TessResultRenderer(outputbase, "pdf") { 00166 obj_ = 0; 00167 datadir_ = datadir; 00168 offsets_.push_back(0); 00169 } 00170 00171 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) { 00172 offsets_.push_back(objectsize + offsets_.back()); 00173 obj_++; 00174 } 00175 00176 void TessPDFRenderer::AppendPDFObject(const char *data) { 00177 AppendPDFObjectDIY(strlen(data)); 00178 AppendString((const char *)data); 00179 } 00180 00181 // Helper function to prevent us from accidentally writing 00182 // scientific notation to an HOCR or PDF file. Besides, three 00183 // decimal points are all you really need. 00184 double prec(double x) { 00185 double kPrecision = 1000.0; 00186 double a = round(x * kPrecision) / kPrecision; 00187 if (a == -0) 00188 return 0; 00189 return a; 00190 } 00191 00192 long dist2(int x1, int y1, int x2, int y2) { 00193 return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1); 00194 } 00195 00196 // Viewers like evince can get really confused during copy-paste when 00197 // the baseline wanders around. So I've decided to project every word 00198 // onto the (straight) line baseline. All numbers are in the native 00199 // PDF coordinate system, which has the origin in the bottom left and 00200 // the unit is points, which is 1/72 inch. Tesseract reports baselines 00201 // left-to-right no matter what the reading order is. We need the 00202 // word baseline in reading order, so we do that conversion here. Returns 00203 // the word's baseline origin and length. 00204 void GetWordBaseline(int writing_direction, int ppi, int height, 00205 int word_x1, int word_y1, int word_x2, int word_y2, 00206 int line_x1, int line_y1, int line_x2, int line_y2, 00207 double *x0, double *y0, double *length) { 00208 if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) { 00209 Swap(&word_x1, &word_x2); 00210 Swap(&word_y1, &word_y2); 00211 } 00212 double word_length; 00213 double x, y; 00214 { 00215 int px = word_x1; 00216 int py = word_y1; 00217 double l2 = dist2(line_x1, line_y1, line_x2, line_y2); 00218 if (l2 == 0) { 00219 x = line_x1; 00220 y = line_y1; 00221 } else { 00222 double t = ((px - line_x2) * (line_x2 - line_x1) + 00223 (py - line_y2) * (line_y2 - line_y1)) / l2; 00224 x = line_x2 + t * (line_x2 - line_x1); 00225 y = line_y2 + t * (line_y2 - line_y1); 00226 } 00227 word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1, 00228 word_x2, word_y2))); 00229 word_length = word_length * 72.0 / ppi; 00230 x = x * 72 / ppi; 00231 y = height - (y * 72.0 / ppi); 00232 } 00233 *x0 = x; 00234 *y0 = y; 00235 *length = word_length; 00236 } 00237 00238 // Compute coefficients for an affine matrix describing the rotation 00239 // of the text. If the text is right-to-left such as Arabic or Hebrew, 00240 // we reflect over the Y-axis. This matrix will set the coordinate 00241 // system for placing text in the PDF file. 00242 // 00243 // RTL 00244 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ] 00245 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ] 00246 void AffineMatrix(int writing_direction, 00247 int line_x1, int line_y1, int line_x2, int line_y2, 00248 double *a, double *b, double *c, double *d) { 00249 double theta = atan2(static_cast<double>(line_y1 - line_y2), 00250 static_cast<double>(line_x2 - line_x1)); 00251 *a = cos(theta); 00252 *b = sin(theta); 00253 *c = -sin(theta); 00254 *d = cos(theta); 00255 switch(writing_direction) { 00256 case WRITING_DIRECTION_RIGHT_TO_LEFT: 00257 *a = -*a; 00258 *b = -*b; 00259 break; 00260 case WRITING_DIRECTION_TOP_TO_BOTTOM: 00261 // TODO(jbreiden) Consider using the vertical PDF writing mode. 00262 break; 00263 default: 00264 break; 00265 } 00266 } 00267 00268 // There are some really stupid PDF viewers in the wild, such as 00269 // 'Preview' which ships with the Mac. They do a better job with text 00270 // selection and highlighting when given perfectly flat baseline 00271 // instead of very slightly tilted. We clip small tilts to appease 00272 // these viewers. I chose this threshold large enough to absorb noise, 00273 // but small enough that lines probably won't cross each other if the 00274 // whole page is tilted at almost exactly the clipping threshold. 00275 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, 00276 int *line_x1, int *line_y1, 00277 int *line_x2, int *line_y2) { 00278 *line_x1 = x1; 00279 *line_y1 = y1; 00280 *line_x2 = x2; 00281 *line_y2 = y2; 00282 double rise = abs(y2 - y1) * 72 / ppi; 00283 double run = abs(x2 - x1) * 72 / ppi; 00284 if (rise < 2.0 && 2.0 < run) 00285 *line_y1 = *line_y2 = (y1 + y2) / 2; 00286 } 00287 00288 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api, 00289 double width, double height) { 00290 STRING pdf_str(""); 00291 double ppi = api->GetSourceYResolution(); 00292 00293 // These initial conditions are all arbitrary and will be overwritten 00294 double old_x = 0.0, old_y = 0.0; 00295 int old_fontsize = 0; 00296 tesseract::WritingDirection old_writing_direction = 00297 WRITING_DIRECTION_LEFT_TO_RIGHT; 00298 bool new_block = true; 00299 int fontsize = 0; 00300 double a = 1; 00301 double b = 0; 00302 double c = 0; 00303 double d = 1; 00304 00305 // TODO(jbreiden) This marries the text and image together. 00306 // Slightly cleaner from an abstraction standpoint if this were to 00307 // live inside a separate text object. 00308 pdf_str += "q "; 00309 pdf_str.add_str_double("", prec(width)); 00310 pdf_str += " 0 0 "; 00311 pdf_str.add_str_double("", prec(height)); 00312 pdf_str += " 0 0 cm /Im1 Do Q\n"; 00313 00314 int line_x1 = 0; 00315 int line_y1 = 0; 00316 int line_x2 = 0; 00317 int line_y2 = 0; 00318 00319 ResultIterator *res_it = api->GetIterator(); 00320 while (!res_it->Empty(RIL_BLOCK)) { 00321 if (res_it->IsAtBeginningOf(RIL_BLOCK)) { 00322 pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink 00323 old_fontsize = 0; // Every block will declare its fontsize 00324 new_block = true; // Every block will declare its affine matrix 00325 } 00326 00327 if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) { 00328 int x1, y1, x2, y2; 00329 res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2); 00330 ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2); 00331 } 00332 00333 if (res_it->Empty(RIL_WORD)) { 00334 res_it->Next(RIL_WORD); 00335 continue; 00336 } 00337 00338 // Writing direction changes at a per-word granularity 00339 tesseract::WritingDirection writing_direction; 00340 { 00341 tesseract::Orientation orientation; 00342 tesseract::TextlineOrder textline_order; 00343 float deskew_angle; 00344 res_it->Orientation(&orientation, &writing_direction, 00345 &textline_order, &deskew_angle); 00346 if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) { 00347 switch (res_it->WordDirection()) { 00348 case DIR_LEFT_TO_RIGHT: 00349 writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT; 00350 break; 00351 case DIR_RIGHT_TO_LEFT: 00352 writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT; 00353 break; 00354 default: 00355 writing_direction = old_writing_direction; 00356 } 00357 } 00358 } 00359 00360 // Where is word origin and how long is it? 00361 double x, y, word_length; 00362 { 00363 int word_x1, word_y1, word_x2, word_y2; 00364 res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2); 00365 GetWordBaseline(writing_direction, ppi, height, 00366 word_x1, word_y1, word_x2, word_y2, 00367 line_x1, line_y1, line_x2, line_y2, 00368 &x, &y, &word_length); 00369 } 00370 00371 if (writing_direction != old_writing_direction || new_block) { 00372 AffineMatrix(writing_direction, 00373 line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d); 00374 pdf_str.add_str_double(" ", prec(a)); // . This affine matrix 00375 pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate 00376 pdf_str.add_str_double(" ", prec(c)); // . system for all 00377 pdf_str.add_str_double(" ", prec(d)); // . text that follows. 00378 pdf_str.add_str_double(" ", prec(x)); // . 00379 pdf_str.add_str_double(" ", prec(y)); // . 00380 pdf_str += (" Tm "); // Place cursor absolutely 00381 new_block = false; 00382 } else { 00383 double dx = x - old_x; 00384 double dy = y - old_y; 00385 pdf_str.add_str_double(" ", prec(dx * a + dy * b)); 00386 pdf_str.add_str_double(" ", prec(dx * c + dy * d)); 00387 pdf_str += (" Td "); // Relative moveto 00388 } 00389 old_x = x; 00390 old_y = y; 00391 old_writing_direction = writing_direction; 00392 00393 // Adjust font size on a per word granularity. Pay attention to 00394 // fontsize, old_fontsize, and pdf_str. We've found that for 00395 // in Arabic, Tesseract will happily return a fontsize of zero, 00396 // so we make up a default number to protect ourselves. 00397 { 00398 bool bold, italic, underlined, monospace, serif, smallcaps; 00399 int font_id; 00400 res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace, 00401 &serif, &smallcaps, &fontsize, &font_id); 00402 const int kDefaultFontsize = 8; 00403 if (fontsize <= 0) 00404 fontsize = kDefaultFontsize; 00405 if (fontsize != old_fontsize) { 00406 char textfont[20]; 00407 snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize); 00408 pdf_str += textfont; 00409 old_fontsize = fontsize; 00410 } 00411 } 00412 00413 bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD); 00414 bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD); 00415 STRING pdf_word(""); 00416 int pdf_word_len = 0; 00417 do { 00418 const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL); 00419 if (grapheme && grapheme[0] != '\0') { 00420 GenericVector<int> unicodes; 00421 UNICHAR::UTF8ToUnicode(grapheme, &unicodes); 00422 char utf16[20]; 00423 for (int i = 0; i < unicodes.length(); i++) { 00424 int code = unicodes[i]; 00425 // Convert to UTF-16BE https://en.wikipedia.org/wiki/UTF-16 00426 if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) { 00427 tprintf("Dropping invalid codepoint %d\n", code); 00428 continue; 00429 } 00430 if (code < 0x10000) { 00431 snprintf(utf16, sizeof(utf16), "<%04X>", code); 00432 } else { 00433 int a = code - 0x010000; 00434 int high_surrogate = (0x03FF & (a >> 10)) + 0xD800; 00435 int low_surrogate = (0x03FF & a) + 0xDC00; 00436 snprintf(utf16, sizeof(utf16), "<%04X%04X>", 00437 high_surrogate, low_surrogate); 00438 } 00439 pdf_word += utf16; 00440 pdf_word_len++; 00441 } 00442 } 00443 delete []grapheme; 00444 res_it->Next(RIL_SYMBOL); 00445 } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD)); 00446 if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) { 00447 double h_stretch = 00448 kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len)); 00449 pdf_str.add_str_double("", h_stretch); 00450 pdf_str += " Tz"; // horizontal stretch 00451 pdf_str += " [ "; 00452 pdf_str += pdf_word; // UTF-16BE representation 00453 pdf_str += " ] TJ"; // show the text 00454 } 00455 if (last_word_in_line) { 00456 pdf_str += " \n"; 00457 } 00458 if (last_word_in_block) { 00459 pdf_str += "ET\n"; // end the text object 00460 } 00461 } 00462 char *ret = new char[pdf_str.length() + 1]; 00463 strcpy(ret, pdf_str.string()); 00464 delete res_it; 00465 return ret; 00466 } 00467 00468 bool TessPDFRenderer::BeginDocumentHandler() { 00469 char buf[kBasicBufSize]; 00470 size_t n; 00471 00472 n = snprintf(buf, sizeof(buf), 00473 "%%PDF-1.5\n" 00474 "%%%c%c%c%c\n", 00475 0xDE, 0xAD, 0xBE, 0xEB); 00476 if (n >= sizeof(buf)) return false; 00477 AppendPDFObject(buf); 00478 00479 // CATALOG 00480 n = snprintf(buf, sizeof(buf), 00481 "1 0 obj\n" 00482 "<<\n" 00483 " /Type /Catalog\n" 00484 " /Pages %ld 0 R\n" 00485 ">>\n" 00486 "endobj\n", 00487 2L); 00488 if (n >= sizeof(buf)) return false; 00489 AppendPDFObject(buf); 00490 00491 // We are reserving object #2 for the /Pages 00492 // object, which I am going to create and write 00493 // at the end of the PDF file. 00494 AppendPDFObject(""); 00495 00496 // TYPE0 FONT 00497 n = snprintf(buf, sizeof(buf), 00498 "3 0 obj\n" 00499 "<<\n" 00500 " /BaseFont /GlyphLessFont\n" 00501 " /DescendantFonts [ %ld 0 R ]\n" 00502 " /Encoding /Identity-H\n" 00503 " /Subtype /Type0\n" 00504 " /ToUnicode %ld 0 R\n" 00505 " /Type /Font\n" 00506 ">>\n" 00507 "endobj\n", 00508 4L, // CIDFontType2 font 00509 6L // ToUnicode 00510 ); 00511 if (n >= sizeof(buf)) return false; 00512 AppendPDFObject(buf); 00513 00514 // CIDFONTTYPE2 00515 n = snprintf(buf, sizeof(buf), 00516 "4 0 obj\n" 00517 "<<\n" 00518 " /BaseFont /GlyphLessFont\n" 00519 " /CIDToGIDMap %ld 0 R\n" 00520 " /CIDSystemInfo\n" 00521 " <<\n" 00522 " /Ordering (Identity)\n" 00523 " /Registry (Adobe)\n" 00524 " /Supplement 0\n" 00525 " >>\n" 00526 " /FontDescriptor %ld 0 R\n" 00527 " /Subtype /CIDFontType2\n" 00528 " /Type /Font\n" 00529 " /DW %d\n" 00530 ">>\n" 00531 "endobj\n", 00532 5L, // CIDToGIDMap 00533 7L, // Font descriptor 00534 1000 / kCharWidth); 00535 if (n >= sizeof(buf)) return false; 00536 AppendPDFObject(buf); 00537 00538 // CIDTOGIDMAP 00539 const int kCIDToGIDMapSize = 2 * (1 << 16); 00540 unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize]; 00541 for (int i = 0; i < kCIDToGIDMapSize; i++) { 00542 cidtogidmap[i] = (i % 2) ? 1 : 0; 00543 } 00544 size_t len; 00545 unsigned char *comp = 00546 zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len); 00547 delete[] cidtogidmap; 00548 n = snprintf(buf, sizeof(buf), 00549 "5 0 obj\n" 00550 "<<\n" 00551 " /Length %lu /Filter /FlateDecode\n" 00552 ">>\n" 00553 "stream\n", (unsigned long)len); 00554 if (n >= sizeof(buf)) { 00555 lept_free(comp); 00556 return false; 00557 } 00558 AppendString(buf); 00559 long objsize = strlen(buf); 00560 AppendData(reinterpret_cast<char *>(comp), len); 00561 objsize += len; 00562 lept_free(comp); 00563 const char *endstream_endobj = 00564 "endstream\n" 00565 "endobj\n"; 00566 AppendString(endstream_endobj); 00567 objsize += strlen(endstream_endobj); 00568 AppendPDFObjectDIY(objsize); 00569 00570 const char *stream = 00571 "/CIDInit /ProcSet findresource begin\n" 00572 "12 dict begin\n" 00573 "begincmap\n" 00574 "/CIDSystemInfo\n" 00575 "<<\n" 00576 " /Registry (Adobe)\n" 00577 " /Ordering (UCS)\n" 00578 " /Supplement 0\n" 00579 ">> def\n" 00580 "/CMapName /Adobe-Identify-UCS def\n" 00581 "/CMapType 2 def\n" 00582 "1 begincodespacerange\n" 00583 "<0000> <FFFF>\n" 00584 "endcodespacerange\n" 00585 "1 beginbfrange\n" 00586 "<0000> <FFFF> <0000>\n" 00587 "endbfrange\n" 00588 "endcmap\n" 00589 "CMapName currentdict /CMap defineresource pop\n" 00590 "end\n" 00591 "end\n"; 00592 00593 // TOUNICODE 00594 n = snprintf(buf, sizeof(buf), 00595 "6 0 obj\n" 00596 "<< /Length %lu >>\n" 00597 "stream\n" 00598 "%s" 00599 "endstream\n" 00600 "endobj\n", (unsigned long) strlen(stream), stream); 00601 if (n >= sizeof(buf)) return false; 00602 AppendPDFObject(buf); 00603 00604 // FONT DESCRIPTOR 00605 const int kCharHeight = 2; // Effect: highlights are half height 00606 n = snprintf(buf, sizeof(buf), 00607 "7 0 obj\n" 00608 "<<\n" 00609 " /Ascent %d\n" 00610 " /CapHeight %d\n" 00611 " /Descent -1\n" // Spec says must be negative 00612 " /Flags 5\n" // FixedPitch + Symbolic 00613 " /FontBBox [ 0 0 %d %d ]\n" 00614 " /FontFile2 %ld 0 R\n" 00615 " /FontName /GlyphLessFont\n" 00616 " /ItalicAngle 0\n" 00617 " /StemV 80\n" 00618 " /Type /FontDescriptor\n" 00619 ">>\n" 00620 "endobj\n", 00621 1000 / kCharHeight, 00622 1000 / kCharHeight, 00623 1000 / kCharWidth, 00624 1000 / kCharHeight, 00625 8L // Font data 00626 ); 00627 if (n >= sizeof(buf)) return false; 00628 AppendPDFObject(buf); 00629 00630 n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_); 00631 if (n >= sizeof(buf)) return false; 00632 FILE *fp = fopen(buf, "rb"); 00633 if (!fp) { 00634 tprintf("Can not open file \"%s\"!\n", buf); 00635 return false; 00636 } 00637 fseek(fp, 0, SEEK_END); 00638 long int size = ftell(fp); 00639 fseek(fp, 0, SEEK_SET); 00640 char *buffer = new char[size]; 00641 if (fread(buffer, 1, size, fp) != size) { 00642 fclose(fp); 00643 delete[] buffer; 00644 return false; 00645 } 00646 fclose(fp); 00647 // FONTFILE2 00648 n = snprintf(buf, sizeof(buf), 00649 "8 0 obj\n" 00650 "<<\n" 00651 " /Length %ld\n" 00652 " /Length1 %ld\n" 00653 ">>\n" 00654 "stream\n", size, size); 00655 if (n >= sizeof(buf)) { 00656 delete[] buffer; 00657 return false; 00658 } 00659 AppendString(buf); 00660 objsize = strlen(buf); 00661 AppendData(buffer, size); 00662 delete[] buffer; 00663 objsize += size; 00664 AppendString(endstream_endobj); 00665 objsize += strlen(endstream_endobj); 00666 AppendPDFObjectDIY(objsize); 00667 return true; 00668 } 00669 00670 bool TessPDFRenderer::imageToPDFObj(Pix *pix, 00671 char *filename, 00672 long int objnum, 00673 char **pdf_object, 00674 long int *pdf_object_size) { 00675 size_t n; 00676 char b0[kBasicBufSize]; 00677 char b1[kBasicBufSize]; 00678 char b2[kBasicBufSize]; 00679 if (!pdf_object_size || !pdf_object) 00680 return false; 00681 *pdf_object = NULL; 00682 *pdf_object_size = 0; 00683 if (!filename) 00684 return false; 00685 00686 L_COMP_DATA *cid = NULL; 00687 const int kJpegQuality = 85; 00688 00689 // TODO(jbreiden) Leptonica 1.71 doesn't correctly handle certain 00690 // types of PNG files, especially if there are 2 samples per pixel. 00691 // We can get rid of this logic after Leptonica 1.72 is released and 00692 // has propagated everywhere. Bug discussion as follows. 00693 // https://code.google.com/p/tesseract-ocr/issues/detail?id=1300 00694 int format, sad; 00695 findFileFormat(filename, &format); 00696 if (pixGetSpp(pix) == 4 && format == IFF_PNG) { 00697 pixSetSpp(pix, 3); 00698 sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid); 00699 } else { 00700 sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid); 00701 } 00702 00703 if (sad || !cid) { 00704 l_CIDataDestroy(&cid); 00705 return false; 00706 } 00707 00708 const char *group4 = ""; 00709 const char *filter; 00710 switch(cid->type) { 00711 case L_FLATE_ENCODE: 00712 filter = "/FlateDecode"; 00713 break; 00714 case L_JPEG_ENCODE: 00715 filter = "/DCTDecode"; 00716 break; 00717 case L_G4_ENCODE: 00718 filter = "/CCITTFaxDecode"; 00719 group4 = " /K -1\n"; 00720 break; 00721 case L_JP2K_ENCODE: 00722 filter = "/JPXDecode"; 00723 break; 00724 default: 00725 l_CIDataDestroy(&cid); 00726 return false; 00727 } 00728 00729 // Maybe someday we will accept RGBA but today is not that day. 00730 // It requires creating an /SMask for the alpha channel. 00731 // http://stackoverflow.com/questions/14220221 00732 const char *colorspace; 00733 if (cid->ncolors > 0) { 00734 n = snprintf(b0, sizeof(b0), 00735 " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n", 00736 cid->ncolors - 1, cid->cmapdatahex); 00737 if (n >= sizeof(b0)) { 00738 l_CIDataDestroy(&cid); 00739 return false; 00740 } 00741 colorspace = b0; 00742 } else { 00743 switch (cid->spp) { 00744 case 1: 00745 colorspace = " /ColorSpace /DeviceGray\n"; 00746 break; 00747 case 3: 00748 colorspace = " /ColorSpace /DeviceRGB\n"; 00749 break; 00750 default: 00751 l_CIDataDestroy(&cid); 00752 return false; 00753 } 00754 } 00755 00756 int predictor = (cid->predictor) ? 14 : 1; 00757 00758 // IMAGE 00759 n = snprintf(b1, sizeof(b1), 00760 "%ld 0 obj\n" 00761 "<<\n" 00762 " /Length %ld\n" 00763 " /Subtype /Image\n", 00764 objnum, (unsigned long) cid->nbytescomp); 00765 if (n >= sizeof(b1)) { 00766 l_CIDataDestroy(&cid); 00767 return false; 00768 } 00769 00770 n = snprintf(b2, sizeof(b2), 00771 " /Width %d\n" 00772 " /Height %d\n" 00773 " /BitsPerComponent %d\n" 00774 " /Filter %s\n" 00775 " /DecodeParms\n" 00776 " <<\n" 00777 " /Predictor %d\n" 00778 " /Colors %d\n" 00779 "%s" 00780 " /Columns %d\n" 00781 " /BitsPerComponent %d\n" 00782 " >>\n" 00783 ">>\n" 00784 "stream\n", 00785 cid->w, cid->h, cid->bps, filter, predictor, cid->spp, 00786 group4, cid->w, cid->bps); 00787 if (n >= sizeof(b2)) { 00788 l_CIDataDestroy(&cid); 00789 return false; 00790 } 00791 00792 const char *b3 = 00793 "endstream\n" 00794 "endobj\n"; 00795 00796 size_t b1_len = strlen(b1); 00797 size_t b2_len = strlen(b2); 00798 size_t b3_len = strlen(b3); 00799 size_t colorspace_len = strlen(colorspace); 00800 00801 *pdf_object_size = 00802 b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len; 00803 *pdf_object = new char[*pdf_object_size]; 00804 if (!pdf_object) { 00805 l_CIDataDestroy(&cid); 00806 return false; 00807 } 00808 00809 char *p = *pdf_object; 00810 memcpy(p, b1, b1_len); 00811 p += b1_len; 00812 memcpy(p, colorspace, colorspace_len); 00813 p += colorspace_len; 00814 memcpy(p, b2, b2_len); 00815 p += b2_len; 00816 memcpy(p, cid->datacomp, cid->nbytescomp); 00817 p += cid->nbytescomp; 00818 memcpy(p, b3, b3_len); 00819 l_CIDataDestroy(&cid); 00820 return true; 00821 } 00822 00823 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) { 00824 size_t n; 00825 char buf[kBasicBufSize]; 00826 Pix *pix = api->GetInputImage(); 00827 char *filename = (char *)api->GetInputName(); 00828 int ppi = api->GetSourceYResolution(); 00829 if (!pix || ppi <= 0) 00830 return false; 00831 double width = pixGetWidth(pix) * 72.0 / ppi; 00832 double height = pixGetHeight(pix) * 72.0 / ppi; 00833 00834 // PAGE 00835 n = snprintf(buf, sizeof(buf), 00836 "%ld 0 obj\n" 00837 "<<\n" 00838 " /Type /Page\n" 00839 " /Parent %ld 0 R\n" 00840 " /MediaBox [0 0 %.2f %.2f]\n" 00841 " /Contents %ld 0 R\n" 00842 " /Resources\n" 00843 " <<\n" 00844 " /XObject << /Im1 %ld 0 R >>\n" 00845 " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n" 00846 " /Font << /f-0-0 %ld 0 R >>\n" 00847 " >>\n" 00848 ">>\n" 00849 "endobj\n", 00850 obj_, 00851 2L, // Pages object 00852 width, 00853 height, 00854 obj_ + 1, // Contents object 00855 obj_ + 2, // Image object 00856 3L); // Type0 Font 00857 if (n >= sizeof(buf)) return false; 00858 pages_.push_back(obj_); 00859 AppendPDFObject(buf); 00860 00861 // CONTENTS 00862 char* pdftext = GetPDFTextObjects(api, width, height); 00863 long pdftext_len = strlen(pdftext); 00864 unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext); 00865 size_t len; 00866 unsigned char *comp_pdftext = 00867 zlibCompress(pdftext_casted, pdftext_len, &len); 00868 long comp_pdftext_len = len; 00869 n = snprintf(buf, sizeof(buf), 00870 "%ld 0 obj\n" 00871 "<<\n" 00872 " /Length %ld /Filter /FlateDecode\n" 00873 ">>\n" 00874 "stream\n", obj_, comp_pdftext_len); 00875 if (n >= sizeof(buf)) { 00876 delete[] pdftext; 00877 lept_free(comp_pdftext); 00878 return false; 00879 } 00880 AppendString(buf); 00881 long objsize = strlen(buf); 00882 AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len); 00883 objsize += comp_pdftext_len; 00884 lept_free(comp_pdftext); 00885 delete[] pdftext; 00886 const char *b2 = 00887 "endstream\n" 00888 "endobj\n"; 00889 AppendString(b2); 00890 objsize += strlen(b2); 00891 AppendPDFObjectDIY(objsize); 00892 00893 char *pdf_object; 00894 if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) { 00895 return false; 00896 } 00897 AppendData(pdf_object, objsize); 00898 AppendPDFObjectDIY(objsize); 00899 delete[] pdf_object; 00900 return true; 00901 } 00902 00903 00904 bool TessPDFRenderer::EndDocumentHandler() { 00905 size_t n; 00906 char buf[kBasicBufSize]; 00907 00908 // We reserved the /Pages object number early, so that the /Page 00909 // objects could refer to their parent. We finally have enough 00910 // information to go fill it in. Using lower level calls to manipulate 00911 // the offset record in two spots, because we are placing objects 00912 // out of order in the file. 00913 00914 // PAGES 00915 const long int kPagesObjectNumber = 2; 00916 offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1 00917 n = snprintf(buf, sizeof(buf), 00918 "%ld 0 obj\n" 00919 "<<\n" 00920 " /Type /Pages\n" 00921 " /Kids [ ", kPagesObjectNumber); 00922 if (n >= sizeof(buf)) return false; 00923 AppendString(buf); 00924 size_t pages_objsize = strlen(buf); 00925 for (size_t i = 0; i < pages_.size(); i++) { 00926 n = snprintf(buf, sizeof(buf), 00927 "%ld 0 R ", pages_[i]); 00928 if (n >= sizeof(buf)) return false; 00929 AppendString(buf); 00930 pages_objsize += strlen(buf); 00931 } 00932 n = snprintf(buf, sizeof(buf), 00933 "]\n" 00934 " /Count %d\n" 00935 ">>\n" 00936 "endobj\n", pages_.size()); 00937 if (n >= sizeof(buf)) return false; 00938 AppendString(buf); 00939 pages_objsize += strlen(buf); 00940 offsets_.back() += pages_objsize; // manipulation #2 00941 00942 // INFO 00943 char* datestr = l_getFormattedDate(); 00944 n = snprintf(buf, sizeof(buf), 00945 "%ld 0 obj\n" 00946 "<<\n" 00947 " /Producer (Tesseract %s)\n" 00948 " /CreationDate (D:%s)\n" 00949 " /Title (%s)" 00950 ">>\n" 00951 "endobj\n", obj_, TESSERACT_VERSION_STR, datestr, title()); 00952 lept_free(datestr); 00953 if (n >= sizeof(buf)) return false; 00954 AppendPDFObject(buf); 00955 n = snprintf(buf, sizeof(buf), 00956 "xref\n" 00957 "0 %ld\n" 00958 "0000000000 65535 f \n", obj_); 00959 if (n >= sizeof(buf)) return false; 00960 AppendString(buf); 00961 for (int i = 1; i < obj_; i++) { 00962 n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]); 00963 if (n >= sizeof(buf)) return false; 00964 AppendString(buf); 00965 } 00966 n = snprintf(buf, sizeof(buf), 00967 "trailer\n" 00968 "<<\n" 00969 " /Size %ld\n" 00970 " /Root %ld 0 R\n" 00971 " /Info %ld 0 R\n" 00972 ">>\n" 00973 "startxref\n" 00974 "%ld\n" 00975 "%%%%EOF\n", 00976 obj_, 00977 1L, // catalog 00978 obj_ - 1, // info 00979 offsets_.back()); 00980 if (n >= sizeof(buf)) return false; 00981 AppendString(buf); 00982 return true; 00983 } 00984 } // namespace tesseract