pdfmm 0.9.20
Public Member Functions | Static Public Member Functions | Static Public Attributes | Protected Member Functions | List of all members
mm::PdfTokenizer Class Reference

#include <PdfTokenizer.h>

Inheritance diagram for mm::PdfTokenizer:
mm::PdfPostScriptTokenizer

Public Member Functions

bool TryReadNextToken (PdfInputDevice &device, std::string_view &token)
 
bool IsNextToken (PdfInputDevice &device, const std::string_view &token)
 
int64_t ReadNextNumber (PdfInputDevice &device)
 
void ReadNextVariant (PdfInputDevice &device, PdfVariant &variant, PdfEncrypt *encrypt=nullptr)
 

Static Public Member Functions

static bool IsWhitespace (unsigned char ch)
 
static bool IsDelimiter (unsigned char ch)
 
static bool IsTokenDelimiter (unsigned char ch, PdfTokenType &tokenType)
 
static bool IsRegular (unsigned char ch)
 
static bool IsPrintable (unsigned char ch)
 
static int GetHexValue (unsigned char ch)
 

Static Public Attributes

static constexpr unsigned HEX_NOT_FOUND = std::numeric_limits<unsigned>::max()
 

Protected Member Functions

void ReadNextVariant (PdfInputDevice &device, const std::string_view &token, PdfTokenType tokenType, PdfVariant &variant, PdfEncrypt *encrypt)
 
void EnqueueToken (const std::string_view &token, PdfTokenType type)
 
void ReadDictionary (PdfInputDevice &device, PdfVariant &variant, PdfEncrypt *encrypt)
 
void ReadArray (PdfInputDevice &device, PdfVariant &variant, PdfEncrypt *encrypt)
 
void ReadString (PdfInputDevice &device, PdfVariant &variant, PdfEncrypt *encrypt)
 
void ReadHexString (PdfInputDevice &device, PdfVariant &variant, PdfEncrypt *encrypt)
 
void ReadName (PdfInputDevice &device, PdfVariant &variant)
 
PdfLiteralDataType DetermineDataType (PdfInputDevice &device, const std::string_view &token, PdfTokenType tokenType, PdfVariant &variant)
 

Detailed Description

A simple tokenizer for PDF files and PDF content streams

Member Function Documentation

◆ DetermineDataType()

PdfTokenizer::PdfLiteralDataType PdfTokenizer::DetermineDataType ( PdfInputDevice device,
const std::string_view &  token,
PdfTokenType  tokenType,
PdfVariant variant 
)
protected

Determine the possible datatype of a token. Numbers, reals, bools or nullptr values are parsed directly by this function and saved to a variant.

Returns
the expected datatype

◆ EnqueueToken()

void PdfTokenizer::EnqueueToken ( const std::string_view &  token,
PdfTokenType  type 
)
protected

Add a token to the queue of tokens. tryReadNextToken() will return all enqueued tokens first before reading new tokens from the input device.

Parameters
tokenstring of the token
typetype of the token
See also
tryReadNextToken

◆ GetHexValue()

int PdfTokenizer::GetHexValue ( unsigned char  ch)
static

Get the hex value from a static map of a given hex character (0-9, A-F, a-f).

Parameters
chhex character
Returns
hex value or HEX_NOT_FOUND if invalid
See also
HEX_NOT_FOUND

◆ IsDelimiter()

bool PdfTokenizer::IsDelimiter ( unsigned char  ch)
static

Returns true if the given character is a delimiter according to the pdf reference

◆ IsNextToken()

bool PdfTokenizer::IsNextToken ( PdfInputDevice device,
const std::string_view &  token 
)

Reads the next token from the current file position ignoring all comments and compare the passed token to the read token.

If there is no next token available, throws UnexpectedEOF.

Parameters
tokena token that is compared to the read token
Returns
true if the read token equals the passed token.

◆ IsPrintable()

bool PdfTokenizer::IsPrintable ( unsigned char  ch)
static

True if the passed character is within the generally accepted "printable" ASCII range.

◆ IsRegular()

bool PdfTokenizer::IsRegular ( unsigned char  ch)
static

True if the passed character is a regular character according to the PDF reference (Section 3.1.1, Character Set); ie it is neither a white-space nor a delimiter character.

◆ IsTokenDelimiter()

bool PdfTokenizer::IsTokenDelimiter ( unsigned char  ch,
PdfTokenType &  tokenType 
)
static

Returns true if the given character is a token delimiter

◆ IsWhitespace()

bool PdfTokenizer::IsWhitespace ( unsigned char  ch)
static

Returns true if the given character is a whitespace according to the pdf reference

Returns
true if it is a whitespace character otherwise false

◆ ReadArray()

void PdfTokenizer::ReadArray ( PdfInputDevice device,
PdfVariant variant,
PdfEncrypt encrypt 
)
protected

Read an array from the input device and store it into a variant.

Parameters
variantstore the array into this variable
encryptan encryption object which is used to decrypt strings during parsing

◆ ReadDictionary()

void PdfTokenizer::ReadDictionary ( PdfInputDevice device,
PdfVariant variant,
PdfEncrypt encrypt 
)
protected

Read a dictionary from the input device and store it into a variant.

Parameters
variantstore the dictionary into this variable
encryptan encryption object which is used to decrypt strings during parsing

◆ ReadHexString()

void PdfTokenizer::ReadHexString ( PdfInputDevice device,
PdfVariant variant,
PdfEncrypt encrypt 
)
protected

Read a hex string from the input device and store it into a variant.

Parameters
variantstore the hex string into this variable
encryptan encryption object which is used to decrypt strings during parsing

◆ ReadName()

void PdfTokenizer::ReadName ( PdfInputDevice device,
PdfVariant variant 
)
protected

Read a name from the input device and store it into a variant.

Throws UnexpectedEOF if there is nothing to read.

Parameters
variantstore the name into this variable

◆ ReadNextNumber()

int64_t PdfTokenizer::ReadNextNumber ( PdfInputDevice device)

Read the next number from the current file position ignoring all comments.

Raises NoNumber exception if the next token is no number, and UnexpectedEOF if no token could be read. No token is consumed if NoNumber is thrown.

Returns
a number read from the input device.

◆ ReadNextVariant() [1/2]

void mm::PdfTokenizer::ReadNextVariant ( PdfInputDevice device,
const std::string_view &  token,
PdfTokenType  tokenType,
PdfVariant variant,
PdfEncrypt encrypt 
)
protected

Read the next variant from the current file position ignoring all comments.

Raises an exception if there is no variant left in the file.

Parameters
tokena token that has already been read
typetype of the passed token
variantwrite the read variant to this value
encryptan encryption object which is used to decrypt strings during parsing

◆ ReadNextVariant() [2/2]

void PdfTokenizer::ReadNextVariant ( PdfInputDevice device,
PdfVariant variant,
PdfEncrypt encrypt = nullptr 
)

Read the next variant from the current file position ignoring all comments.

Raises an UnexpectedEOF exception if there is no variant left in the file.

Parameters
variantwrite the read variant to this value
encryptan encryption object which is used to decrypt strings during parsing

◆ ReadString()

void PdfTokenizer::ReadString ( PdfInputDevice device,
PdfVariant variant,
PdfEncrypt encrypt 
)
protected

Read a string from the input device and store it into a variant.

Parameters
variantstore the string into this variable
encryptan encryption object which is used to decrypt strings during parsing

◆ TryReadNextToken()

bool mm::PdfTokenizer::TryReadNextToken ( PdfInputDevice device,
std::string_view &  token 
)

Reads the next token from the current file position ignoring all comments.

Parameters
[out]tokenOn true return, set to a pointer to the read token (a nullptr-terminated C string). The pointer is to memory owned by PdfTokenizer and must NOT be freed. The contents are invalidated on the next call to tryReadNextToken(..) and by the destruction of the PdfTokenizer. Undefined on false return.
[out]tokenTypeOn true return, if not nullptr the type of the read token will be stored into this parameter. Undefined on false return.
Returns
True if a token was read, false if there are no more tokens to read.

Member Data Documentation

◆ HEX_NOT_FOUND

constexpr unsigned mm::PdfTokenizer::HEX_NOT_FOUND = std::numeric_limits<unsigned>::max()
staticconstexpr

Constant which is returned for invalid hex values.