OpenGL ES Application
J2EE Application with EJB
JSF Based WEB Application
FileSystem Driver (Windows & Linux)
Compiler and Code Analyst Tool
Campus information system consulting
Self-adapting Test system
Interactive e-Learning system
PDFBox is a java class library for pdf file operationg. It has rich features, but I just care about using it to extract text from pdf file. It is simple because the author has considered this usage and provided special supports. This is why I choose PDFBox instead of iText, of cause the size of the library is another important factor.
But PDFBox cannot handle Chinese text well in PDF file. I googled it and found a lot of complains but no solution. OK, le me dave into the code. Foutrantly, it is not such complex. After a whole day in a little black room, it is resolved.
PDF Box Overview
1. PDF Box usage in mysoo touched three level of abstraction:
PDFTextStripper ==> PD Model ==> COS Model
COS Model is direct mapping of the PDF format, PD Model is OO wrapper of COS Model, the relation is shown in following diagram:

The above half is PDModel, every object in PDModel is a wrapper for related COSModel object. The bottom half is COSModel, every object in COSModel has its directly mapped segment of text in PDF.
PDFTextStripper is a utility, it encapsulates the operations to get all pages of the document using PDModel and retreive characters form each object which possible contains text.
2. The process of PDF Box extracting text
PDFTextStripper.writeText()
The real code for text handling is in PDFStreamEngine.showString(), before it, SetGraphicsStateParameters has parsed and saved the current format parameters of drawing from the PDF, the parameter related to the text handling is the PDFont. ShowString() will use this PDFont to decode the byte stream from the pdf file, and construct a string properly.
3. The process of decoding string in PDFStreamEngine
PDFStreamEngine.showString(byte[] string)
for each byte in input string
firtly try PDFont.encode( string[offset], 1 byte )
if return null, then try PDFont.encode( string[offset], 2 bytes)
PDFont.encode( bytes, count ), convert bytes into a single char string
PDFont try to get CMap
if the font has embbed TO_UNICODE Cmap then parse it
else retrieve Cmap based on ENCODING attribute of the font
// in my test, cn.pdf is Type0 font, encodingName is GBK-EUC-H
// GBK-EUC-H stands for MS CP936 (lfCharSet 0x86), GBK charset, GBK encoding
// at first block of text, the cmapObjects is empty.
PDFont perform cmap name subtitution
PDFont parseCmap from Resource/cmap/cmapName (GBK-EUC-H)
CmapParser.parse(resStream) analyze the stream and construct an instance of cmap
register the cmap instance to cmapObjects {encodingName, cmap}
If a null is return in single byte char decoding, try two-bytes char decoding
cmap.lookup( bytes, count )
if not found in char map
PDFont.getEncoding()
get encoding from font ENCODING attribute (GBK-EUC-H)
EncodingManager.getEncoding( encoding )
Manager search the internal ENCODINGS table to return the Encoding object with the name of encoding
the table is initialized in the static block of Manger class
ENCODINGS.put( COSName.GBK_EUC_H_ENCODING, new GbkEucHEncoding() );
PDFont.getCodeFromArray(bytes, count)
example:'?' one byte is 0xC7, two byte is 0xC7B0
Encoding.getCharacter(code)
if Encoding cannot handle it, call getStringFromArray()
faint!
PDFont:
protected int getCodeFromArray( byte[] data, int offset, int length )
{
int code = 0;
for( int i=0; i<length; i++ )
{
code <<= 8;
code = (data[offset+i]+256)%256; //????? |=
}
return code;
}
// notes:PDFont.java
395~401, it seems the if branch is not useful
// cmap = (CMap)cmapObjects.get( encodingName );
// if( cmap != null )
// {
// cmap = (CMap)cmapObjects.get( encodingName );
// }
// else
// { ... }
Font segment in PDF file:
0020de0: 626a 0d3c 3c20 0d2f 5479 7065 202f 466f bj.<< ./Type /Fo
0020df0: 6e74 200d 2f53 7562 7479 7065 202f 5479 nt ./Subtype /Ty
0020e00: 7065 3020 0d2f 4e61 6d65 202f 4631 200d pe0 ./Name /F1 .
0020e10: 2f42 6173 6546 6f6e 7420 2f23 4241 2344 /BaseFont /#BA#D
0020e20: 4123 4343 2345 3520 0d2f 4465 7363 656e A#CC#E5 ./Descen
0020e30: 6461 6e74 466f 6e74 7320 5b20 3230 3720 dantFonts [ 207
0020e40: 3020 5220 5d20 0d2f 456e 636f 6469 6e67 0 R ] ./Encoding
0020e50: 202f 4742 4b2d 4555 432d 4820 0d3e 3e20 /GBK-EUC-H .>>
Type0 compond font: base name of the font is BADA CCE5 is simhei, encoding is GBK-EUC-H, CID sub font 210 0 R (PDF Object ID)
0020e60: 0d65 6e64 6f62 6a0d 3230 3720 3020 6f62 .endobj.207 0 ob
0020e70: 6a0d 3c3c 200d 2f54 7970 6520 2f46 6f6e j.<< ./Type /Fon
0020e80: 7420 0d2f 5375 6274 7970 6520 2f43 4944 t ./Subtype /CID
CID Type2 font: base font simhei, WinCharSet 0x86, descriptor 208 0 R(PDF Object ID)
0020e90: 466f 6e74 5479 7065 3220 0d2f 4261 7365 FontType2 ./Base
0020ea0: 466f 6e74 202f 2342 4123 4441 2343 4323 Font /#BA#DA#CC#
0020eb0: 4535 200d 2f57 696e 4368 6172 5365 7420 E5 ./WinCharSet
0020ec0: 3133 3420 0d2f 466f 6e74 4465 7363 7269 134 ./FontDescri
0020ed0: 7074 6f72 2032 3038 2030 2052 200d 2f43 ptor 208 0 R ./C
0020ee0: 4944 5379 7374 656d 496e 666f 203c 3c20 IDSystemInfo <<
0020ef0: 2f52 6567 6973 7472 7920 284b 77b0 6789 /Registry (Kw.g.
0020f00: 292f 4f72 6465 7269 6e67 2028 4d51 ee29 )/Ordering (MQ.)
0020f10: 2f53 7570 706c 656d 656e 7420 3220 3e3e /Supplement 2 >>
0020f20: 200d 2f44 5720 3130 3030 200d 2f57 205b ./DW 1000 ./W [
0020f30: 2038 3134 2039 3037 2035 3030 2037 3731 814 907 500 771
0020f40: 3620 3737 3136 2035 3030 205d 200d 3e3e 6 7716 500 ] .>>
0020f50: 200d 656e 646f 626a 0d32 3038 2030 206f .endobj.208 0 o
Resolve the problem
Holly(http://www.jsfsoft.com:8080/beyond-pebble/lee) think we can resolve it more graceful and created a patch. This patch is submitted to the pdfbox.org. Before the offical version is updated, the patch is availabe at (http://sourceforge.net/tracker/index.php?func=detail&aid=1640071&group_id=78314&atid=552834).