Dan Sharp
Posted 5 December, 2022 by Dan Sharp in
Configuration > Spider > Extraction > PDF
Store PDF
This allows you to save PDFs to disk during a crawl. They can be bulk exported via ‘Bulk Export > Web > All PDF Documents’, or just the content can be exported as .txt files via ‘Bulk Export > Web > All PDF Content’.
When PDFs are stored, the PDF can be viewed in the ‘Rendered Page’ tab and the text content of the PDF can be viewed in the ‘View Source’ tab and ‘Visible Content’ filter.
Extract PDF Properties
By default the PDF title and keywords will be extracted. These will appear in the ‘Title’ and ‘Meta Keywords’ columns in the Internal tab of the SEO Spider.
Google will convert the PDF to HTML and use the PDF title as the title element and the keywords as meta keywords, although it doesn’t use meta keywords in scoring.
By enabling ‘Extract PDF properties’, the following additional properties will also be extracted.
- Subject
- Author
- Creation Date
- Modification Date
- Page Count
- Word Count
These new columns are displayed in the Internal tab.
Extract Link Text
When this setting is enabled, the SEO Spider will attempt to locate the text associated with links within PDFs. When this is disabled, the columns will be blank.
The anchor text can be viewed in the lower Outlinks (and Inlinks) tabs associated with links.
Depending on the format of the PDF, this can be inaccurate, slow and memory intensive.