Dan Sharp
Posted 5 December, 2022 by Dan Sharp in
Configuration > Spider > Extraction > PDF
Store PDF
This allows you to save PDFs to disk during a crawl. They can be bulk exported via ‘Bulk Export > Web > All PDF Documents’, or just the content can be exported as .txt files via ‘Bulk Export > Web > All PDF Content’.
When PDFs are stored, the PDF can be viewed in the ‘Rendered Page’ tab and the text content of the PDF can be viewed in the ‘View Source’ tab and ‘Visible Content’ filter.
Extract PDF Properties
By default the PDF title and keywords will be extracted. These will appear in the ‘Title’ and ‘Meta Keywords’ columns in the Internal tab of the SEO Spider.
Google will convert the PDF to HTML and use the PDF title as the title element and the keywords as meta keywords, although it doesn’t use meta keywords in scoring.
By enabling ‘Extract PDF properties’, the following additional properties will also be extracted.
- Subject
- Author
- Creation Date
- Modification Date
- Page Count
- Word Count
These new columns are displayed in the Internal tab.