Open Access Paper
28 December 2022 Forensic authenticity examination of PDF documents
Jinhua Zeng, Xiulian Qiu
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125064Y (2022) https://doi.org/10.1117/12.2662214
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
With the rapid development and widespread popularization of information technology, Portable Document Format (PDF) documents have gradually become a type of digital files that are easily accessible and closely related to the public. In digital forensics, digital files in the form of PDF documents are often encountered, and their authenticity needs to be determined. Therefore, the research on its key forensic techniques has important theoretical research significance and practical application value. However, through searching the literature, we can find that there is still a lack of systematic research on forensic authentication of PDF documents. Based on the above situation, this paper first studies the file structure and digital composition of PDF documents; The digital data characteristics and the contents for examination of PDF files produced by the mainstream scenarios are studied, including PDF generated by scanners, converting directly from images, and transforming from DOCX documents, etc. Finally, a case study is carried out to explain the key technologies and examination contents for authenticity examinations of PDF documents in detail. Our study will provide theoretical basis and practical guidance for forensic investigations of authenticity examinations of PDF documents.

1.

INTRODUCTION

With the rapid development and widespread popularization of information technology, office electronic documents have gradually become digital files that are easily accessible and closely related to the public. Among them, PDF document is a type of digital file that is independent of operating system platform and is a formal international standard, all of these advantages make it become an ideal carrier of electronic document distribution and digital information dissemination. PDF is first proposed by Adobe, in which the multimedia information, such as text, sound and image, and can cover hypertext links, typeset style and others, can be effectively integrated. Due to the extensive application in daily life, we often have to determine the authenticity of the PDF files which are submitted in courts to serve as digital evidence.

Many works1-7 have focused on the forensics of the PDF documents. For example, Adhatarao and Lauradoux8 used the coding styles of the PDF documents to identify the PDF producing tools. Through literature search, it is found that more attention is paid to the data recovery of PDF files9, 10, there is still a lack of systematic study on the forensic authenticity examination of PDF documents.

Based on the above situation, firstly we discuss the PDF document format in the paper, and then the problems of the forensic authenticity examination of PDF documents generated by the common ways are further studied, including the PDF files generated by scanner devices, PDF files transformed from images directly, and the ones derived from the word processing softwares, such as WPS Office and Microsoft Office. Finally, a case study is carried out to demonstrate the key methods and contents for effective forensic authenticity examination of PDF documents.

2.

PDF FILE FORMAT

2.1

PDF file structure

A PDF file transformed from a “.docx” file is used as an example to study the file structure of PDF documents, which is created and then converted into PDF documents by using WPS Office (version 11.1.0.11294).

Our results find that the PDF file mainly consists of four parts, that are file head, file body, cross-reference table and file tail. Its file header is the characters “%PDF-1.x”, in which the last digit indicates the version number of the PDF file. The highest version of the PDF updated by Adobe is PDF 1.7 and the subsequent versions are maintained and released by the International Standards Organization (ISO). The PDF file body consists of several “obj” objects, in which the first number “7” is used to uniquely identify the object number, and the second number “0” is used to indicate the number of changes that the object has undergone after being created, which is called the object generation number. The object generation number of a newly created PDF file is 0, indicating that the object has not been modified. The information of each “obj” object is contained between the charactericters “<<”and “>>”, and ends with the keyword “endobj”. The cross-reference table contains the location index information of each “obj” object, starting with the keyword “xref”. The number “0 15” in the second line demonstrates that the object number starts from 0 and there are 15 objects in total. Generally, the third row of the cross-index table is fixed characters “0000000000 65535 F”, in which the first field “0000000000” indicates the starting position of the object. The second field “65535” shows the maximum possible object generation number. The third field “f” demonstrateds that the object is a free object, and the character “n” indicates the object is in use and can be modified. In the file tail, the starting keyword is “trailer”. The “%%EOF” is used to indicate the end of the file, in which “/Size 15” indicates the total number of objects in the file, “/Root 1 0 R” indicates that the object number of the root object starts at 1, and the subsequent “startxref 22705” indicates that the offset address of the cross-reference table is “22705”. The access of each object is realized by combining the starting position of the object in the cross-reference table.

2.2

The digital composition of PDF files

We use a PDF file transformed directly from an image by using Adobe Acrobat X (version 10.1.16) as an example. The file size of the image is 707,352 bytes and the file name is “img_5820-psresave.jpg”. The size of the generated PDF file is 716,945 bytes. The X-Ways Forensics (version 20.0 SR-5 X64) is used to analyze its digital components. The file signature is used to analyze the different types of data embedded in the PDF file. The digital components of the above PDF file are shown in Figure 1. As we can see in Figure 1, the objects of the PDF file mainly consist of XML data and JPG images. In addition, a thumbnail image is embeded in the JPG image.

Figure 1.

The digital composition of the PDF file in the sample.

00174_PSISDG12506_125064Y_page_2_1.jpg

The XML data information viewed in Notepad++ (version 8.1.4) is shown in Figure 2. It can be found that the XML data mainly contains metadata information of PDF files, such as the PDF making program, creation time and modification time, among which the modification time is the saving time of the PDF file.

Figure 2.

The content of XML data parsed in the PDF file.

00174_PSISDG12506_125064Y_page_3_1.jpg

Comparing the digital data of the original image and the JPG image parsed from the PDF file with Beyond Compare 4 (version 4.1.2) software. we find that the metadata information in the header of the original image is basically preserved, but the digital data of the main body differ greatly, which may have been re-encoded.

3.

FORENSIC AUTHENTICATION OF PDF FILES

3.1

Forensic authenticity examination of PDF files generated by scanners

Due to the feature of optical transfer printing, the original digital data of images cannot be retained in the PDF files generated by scanners. Therefore, the effective contents for forensic authenticity examination of PDF files produced by scanners are mainly focused on the metadata information of the PDF files. In this paper, the EPSON Chops V330 Photo scanner and Fuji Xerox Docucente-V C2265 MFP are taken as examples to study the relevant points for examination. The results show that most metadata information of PDF files generated by EPSON Chops V330 Photo scanner was blank, and only PDF production program and PDF version is saved. The metadata information of PDF files produced by Fuji Xerox DocuCentre V C2265 all-in-one printer is relatively rich, as shown in Figure 3. In addition to basic information about the PDF maker, it also contains information about XMP-related fields, such as creation time and modification time.

Figure 3.

Metadata information of PDF files produced by Fuji Xerox DocuCentre-V C2265 MFP.

00174_PSISDG12506_125064Y_page_3_2.jpg

3.2

Forensic authenticity examination of PDF files converted from images

As mentioned in section 2.2, most of the file header data of the original image is retained in the process of the convertion from images into PDF documents by using Adobe Acrobat software. Therefore, in addition to the examination of the metadata information of the PDF file, the metadata information of digital images can be effectively examined according to the technical specification for forensic digital image metadata examination11, and so on. Moreover, image data information in PDF files can be extracted and resaved as independent image files, and then the contents of imaging features, processing traces can be used for forensic authenticity examination according to related technical specification for forensic image authenticity examination12, 13.

3.3

Forensic authenticity examination of PDF files transformed from DOCX documents

In the test, DOCX documents created by WPS Office (version 11.1.0.11294) and Microsoft Office 2019 are both studied. One image is embedded in the DOCX document, which is then converted into PDF documents by the built-in PDF converter in WPS Office and Microsoft Office. The results show that when the image is embeded in the DOCX documents, it will be re-compressed and re-encoded. The original data in the file header and the end of the file are basically erased. Therefore, forensic authenticity examination of PDF files in this case is mainly focused on the examination of the metadata information of PDF files. The metadata information of PDF files produced by WPS Office (version 11.1.0.11294) is shown in Figure 4, which consists of the information, such as PDF generation tool, creation time, modification time, author information and others.

Figure 4.

Metadata information of a typical PDF file produced by WPS Office (version 11.1.0.11294).

00174_PSISDG12506_125064Y_page_4_1.jpg

4.

CASE STUDY

In a contract dispute case, it is necessary to verify the authenticity of the contract content which is in the form of PDF document. One party claims that the PDF document is created by scanning the original paper contract, while the other party claims that the original paper contract does not have its own signature, believing that the signature in the PDF document content is synthesized.

In the examination, “Specification for forensic authentication of digital documents (standart No. SF/Z JD0402004-2018)” and “Specification for forensic examination of digital image metadata (No. SF/T 0078-2020)” are followed. We use the tools including X-ways Forensics (version 20.0 SR-5 X64), Adobe Acrobat (version 10.1.16), ExifTool (version 8.02), etc.

Firstly, the metadata of the PDF file is examined, as shown in Figure 5, which shows that the PDF file is created by using “Adobe Acrobat 10.1.16 Image Conversion Plug-in”. Its creation time is “2020:02:09 14:43:05”, and the modification time is “2020:02:09 14:43:41”.

Figure 5.

The metadata of the PDF file.

00174_PSISDG12506_125064Y_page_4_2.jpg

After further digital analysis of the PDF file, we can find that the PDF file contains an image, which is then extracted and saved as a “.JPG” file. The metadata of the image file is shown in Figure 6, which shows that its production software is “Adobe Photoshop CC 2019 (Windows)”, and the modification time is “2020:02:09 14:42:47”. In addition, there is a “Photoshop” field information embedded in the metadata.

Figure 6.

The Metadata of the parsed image file in the PDF.

00174_PSISDG12506_125064Y_page_5_1.jpg

Based on the above results, it can be inferred that the PDF file is first produced by “EPSON Scan”, then resaved as an image by using “Adobe Photoshop CC 2019 (Windows)”. After that, the “Adobe Acrobat 10.1.16” is used to convert the image file into the final PDF document. Therefore, the statement that the PDF document is directly created by the scanner can not be hold up.

5.

CONCLUSION

In this paper, we focus on the topic of forensic authenticity examination of PDF documents. Their file structure and digital composition are studied. The key points for forensic authenticity examination of PDF documents created by the forms of scanning by scanners, conversion from images, and transforming from DOCX docements are discussed. Then, a case study is carried out to describe the the above problem in detail. As we mention above, the wide usage of PDF documents makes PDF ducuments be more and more used as digital evidence in courts, and then we have to determine the authenticity of the submitted PDF documents. Our results can provide some insightful ideas for forensic authenticity examination of PDF docements and can be used as a practical guidance for practical case examination.

ACKNOWLEDGMENTS

This work was supported by Shanghai Science and Technology Commission Project (grant numbers 21DZ2200100), Shanghai Forensic Service Platform (16DZ2290900), and Ministry of Finance, PR China (grant numbers GY2021G-3 and GY2020G-8).

REFERENCES

[1] 

Castiglione, A., De Santis, A. and Soriente, C., “Security and privacy issues in the portable document format,” Journal of Systems and Software, 83 (10), 1813 –1822 (2010). https://doi.org/10.1016/j.jss.2010.04.062 Google Scholar

[2] 

Aminnezhad, A., Dehghantanha, A. and Abdullah, M. T., “A survey on privacy issues in digital forensics,” International Journal of Cyber-Security and Digital Forensics, 1 (4), 311 –324 (2012). Google Scholar

[3] 

Khitan, S. J., Hadi, A. and Atoum, J., “PDF forensic analysis system using YARA,” International Journal of Computer Science and Network Security, 17 (5), 77 –85 (2017). Google Scholar

[4] 

Chung, H., Park, J. and Lee, S., “Forensic analysis of residual information in adobe PDF files,” 100 –109 Future Information Technology, Springer, Berlin, Heidelberg (2011). Google Scholar

[5] 

Alanazi, F. and Jones, A., “The value of metadata in digital forensics,” in Proc. IEEE 2015 European Intelligence and Security Informatics Conference, 182 –182 (2015). Google Scholar

[6] 

Maiorca, D. and Biggio, B., “Digital investigation of PDF files: Unveiling traces of embedded malware,” IEEE Security & Privacy, 17 (1), 63 –71 (2019). https://doi.org/10.1109/MSEC.2018.2875879 Google Scholar

[7] 

Stevens, D., “Malicious PDF documents explained,” IEEE Security & Privacy, 9 (1), 80 –82 (2011). https://doi.org/10.1109/MSP.2011.14 Google Scholar

[8] 

Adhatarao, S. and Lauradoux, C., “Robust PDF files forensics using coding style,” in IFIP International Conference on ICT Systems Security and Privacy Protection, 179 –195 (2022). Google Scholar

[9] 

Povar, D. and Bhadran, V. K., “Forensic data carving,” in Int. Conf. on Digital Forensics and Cyber Crime, 137 –148 (2010). Google Scholar

[10] 

Sitompul, O. S., Handoko, A. and Rahmat, R. F., “File reconstruction in digital forensic,” Telkomnika, 16 (2), 776 –794 (2018). https://doi.org/10.12928/telkomnika.v16i2 Google Scholar

[11] 

The Information Center of Ministry of Justice PRC, “Technical specification for metadata examination of digital images,” (2020). Google Scholar

[12] 

Photographic Inspection Sub-Technical Committee of National Technical Committee on Criminal Technology of Standardization Administration, “Technical specification of digital image authenticity identification-Image authenticity judge,” (2010). Google Scholar

[13] 

Bureau of Forensic Expertise, Ministry of Justice PRC, “Technical specification for forensic authentication of images,” (2015). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jinhua Zeng and Xiulian Qiu "Forensic authenticity examination of PDF documents", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125064Y (28 December 2022); https://doi.org/10.1117/12.2662214
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Forensic science

Scanners

Image processing

Digital forensics

Data conversion

Focus stacking software

Information technology

RELATED CONTENT

Scanning-time evaluation of Digimarc Barcode
Proceedings of SPIE (March 04 2015)
Document Image Processing The New Image Processing Frontier
Proceedings of SPIE (January 30 1990)
Mark detection from scanned ballots
Proceedings of SPIE (January 19 2009)
Color standards activities in the graphic arts
Proceedings of SPIE (May 09 1994)
Descreening method of scanned halftone image based on wavelet
Proceedings of SPIE (November 15 2007)

Back to Top