Home Technique Reading Works Thoughts e-Education
Shortcut: Chinese Character in Java Web OpenGL Step By Step MacRhine CHMReaderWM5 Instruction Theory Introuction PDFBox CJK
My Own Ad. :-)
Hunting for funny software project. Currently, I'minteresting in:
Mac OS X Driver      Works: Via Rhine Driver for OSX X86
OpenGL ES Application
J2EE Application with EJB
JSF Based WEB Application
FileSystem Driver (Windows & Linux)
Compiler and Code Analyst Tool
Campus information system consulting
Self-adapting Test system
Interactive e-Learning system

pinxue (at) hotmail.com

Enable CJK Charset for PDFBox

PDFBox is a java class library for pdf file operationg. It has rich features, but I just care about using it to extract text from pdf file. It is simple because the author has considered this usage and provided special supports. This is why I choose PDFBox instead of iText, of cause the size of the library is another important factor.

But PDFBox cannot handle Chinese text well in PDF file. I googled it and found a lot of complains but no solution. OK, le me dave into the code. Foutrantly, it is not such complex. After a whole day in a little black room, it is resolved.

PDF Box Overview

1. PDF Box usage in mysoo touched three level of abstraction:

PDFTextStripper ==> PD Model ==> COS Model

COS Model is direct mapping of the PDF format, PD Model is OO wrapper of COS Model, the relation is shown in following diagram:


The above half is PDModel, every object in PDModel is a wrapper for related COSModel object. The bottom half is COSModel, every object in COSModel has its directly mapped segment of text in PDF.

PDFTextStripper is a utility, it encapsulates the operations to get all pages of the document using PDModel and retreive characters form each object which possible contains text.

2. The process of PDF Box extracting text

  1. PDFTextStripper.writeText()

  2.   For each PDPage of PDDocument
  3.     ProcessPage()
  4.       PDFStreamEngine.processStream() in the page
  5.         Parse cosStream to get operator name and its parameters
  6.         Delegate to operator retrieved from stream
  7.           => SetGraphicsStateParameters operator parse GraphicStates
  8.           => ShowText operator handle COSString
  9.             Retreive byte[] from COSString
  10.             PDFStreamEngine.showString(bytes[])

The real code for text handling is in PDFStreamEngine.showString(), before it, SetGraphicsStateParameters has parsed and saved the current format parameters of drawing from the PDF, the parameter related to the text handling is the PDFont. ShowString() will use this PDFont to decode the byte stream from the pdf file, and construct a string properly.

3. The process of decoding string in PDFStreamEngine

PDFStreamEngine.showString(byte[] string)

    for each byte in input string

        firtly try PDFont.encode( string[offset], 1 byte )

        if return null, then try PDFont.encode( string[offset], 2 bytes)

            PDFont.encode( bytes, count ), convert bytes into a single char string

                PDFont try to get CMap

                    if the font has embbed TO_UNICODE Cmap then parse it

                    else retrieve Cmap based on ENCODING attribute of the font

// in my test, cn.pdf is Type0 font, encodingName is GBK-EUC-H

// GBK-EUC-H stands for MS CP936 (lfCharSet 0x86), GBK charset, GBK encoding

// at first block of text, the cmapObjects is empty.

PDFont perform cmap name subtitution

PDFont parseCmap from Resource/cmap/cmapName (GBK-EUC-H)

CmapParser.parse(resStream) analyze the stream and construct an instance of cmap

register the cmap instance to cmapObjects {encodingName, cmap}

                   If a null is return in single byte char decoding, try two-bytes char decoding

cmap.lookup( bytes, count )

if not found in char map

PDFont.getEncoding()

    get encoding from font ENCODING attribute (GBK-EUC-H)

         EncodingManager.getEncoding( encoding )

                Manager search the internal ENCODINGS table to return the Encoding object with the name of encoding

                the table is initialized in the static block of Manger class

                    ENCODINGS.put( COSName.GBK_EUC_H_ENCODING, new GbkEucHEncoding() );

PDFont.getCodeFromArray(bytes, count)

        example:'?' one byte is 0xC7, two byte is 0xC7B0

Encoding.getCharacter(code)

if Encoding cannot handle it, call getStringFromArray()


faint! PDFont:

protected int getCodeFromArray( byte[] data, int offset, int length )

{

int code = 0;

for( int i=0; i<length; i++ )

{

code <<= 8;

code = (data[offset+i]+256)%256; //????? |=

}

return code;

}


// notes:PDFont.java 395~401, it seems the if branch is not useful

// cmap = (CMap)cmapObjects.get( encodingName );

// if( cmap != null )

// {

// cmap = (CMap)cmapObjects.get( encodingName );

// }

// else

// { ... }


Font segment in PDF file:

0020de0: 626a 0d3c 3c20 0d2f 5479 7065 202f 466f bj.<< ./Type /Fo

0020df0: 6e74 200d 2f53 7562 7479 7065 202f 5479 nt ./Subtype /Ty

0020e00: 7065 3020 0d2f 4e61 6d65 202f 4631 200d pe0 ./Name /F1 .

0020e10: 2f42 6173 6546 6f6e 7420 2f23 4241 2344 /BaseFont /#BA#D

0020e20: 4123 4343 2345 3520 0d2f 4465 7363 656e A#CC#E5 ./Descen

0020e30: 6461 6e74 466f 6e74 7320 5b20 3230 3720 dantFonts [ 207

0020e40: 3020 5220 5d20 0d2f 456e 636f 6469 6e67 0 R ] ./Encoding

0020e50: 202f 4742 4b2d 4555 432d 4820 0d3e 3e20 /GBK-EUC-H .>>

Type0 compond font: base name of the font is BADA CCE5 is simhei, encoding is GBK-EUC-H, CID sub font 210 0 R (PDF Object ID)

0020e60: 0d65 6e64 6f62 6a0d 3230 3720 3020 6f62 .endobj.207 0 ob

0020e70: 6a0d 3c3c 200d 2f54 7970 6520 2f46 6f6e j.<< ./Type /Fon

0020e80: 7420 0d2f 5375 6274 7970 6520 2f43 4944 t ./Subtype /CID

CID Type2 font: base font simhei, WinCharSet 0x86, descriptor 208 0 R(PDF Object ID)

0020e90: 466f 6e74 5479 7065 3220 0d2f 4261 7365 FontType2 ./Base

0020ea0: 466f 6e74 202f 2342 4123 4441 2343 4323 Font /#BA#DA#CC#

0020eb0: 4535 200d 2f57 696e 4368 6172 5365 7420 E5 ./WinCharSet

0020ec0: 3133 3420 0d2f 466f 6e74 4465 7363 7269 134 ./FontDescri

0020ed0: 7074 6f72 2032 3038 2030 2052 200d 2f43 ptor 208 0 R ./C

0020ee0: 4944 5379 7374 656d 496e 666f 203c 3c20 IDSystemInfo <<

0020ef0: 2f52 6567 6973 7472 7920 284b 77b0 6789 /Registry (Kw.g.

0020f00: 292f 4f72 6465 7269 6e67 2028 4d51 ee29 )/Ordering (MQ.)

0020f10: 2f53 7570 706c 656d 656e 7420 3220 3e3e /Supplement 2 >>

0020f20: 200d 2f44 5720 3130 3030 200d 2f57 205b ./DW 1000 ./W [

0020f30: 2038 3134 2039 3037 2035 3030 2037 3731 814 907 500 771

0020f40: 3620 3737 3136 2035 3030 205d 200d 3e3e 6 7716 500 ] .>>

0020f50: 200d 656e 646f 626a 0d32 3038 2030 206f .endobj.208 0 o


Resolve the problem

  1. Not found the encoding for GBK-EUC-H
    1. Parameter the StandardEncoding to create one
    2. cannot process without it
  2. GBKEucHEncoding does not provide right getCharacter method
    1. try to create on to analyze the value of byte is in GBK range or not
    2. does not resolve the problem
  3. Found logic error, font.encode() never return null, so that the two-byte part flow is not reachable
  4. Found after the cmap/GBK-EUC-H is parsed, there is no doubleByteMap
  5. CMapParser doesn't handle begincidrange, only process beginbfrange/char/space
    1. add logic for begincidrange
    2. not resolved, cidrange only affects get glyphy from font
  6. Return to the encoding
    1. Fix PDFont.getCodeFromArray()
    2. GbkEucHEncoding.getCharacter() analyze the lead byte and convert bytesto a signle char string
    3. Found encode() always returns an one-byte char at lead byte, patch it
    4. it works now.

Holly(http://www.jsfsoft.com:8080/beyond-pebble/lee) think we can resolve it more graceful and created a patch. This patch is submitted to the pdfbox.org. Before the offical version is updated, the patch is availabe at (http://sourceforge.net/tracker/index.php?func=detail&aid=1640071&group_id=78314&atid=552834).

update log:
2007-1-20
published
2006-8-21
created
BLOG | Chinese | 2005 2004 2003 2002-2001 2000 1999-1998

 
Web www.pinxue.net
Sh.ICP.Reg.No.06002693