http://www.basicsbehind.com/extract-text-webpage/ http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf https://jsoup.org/ https://uhack-guide.readthedocs.io/en/latest/technical/scraping/ https://blog.ouseful.info/2015/02/09/getting-text-of-anything-docs-pdfs-images-using-apache-tika/ https://tika.apache.org/1.20/examples.html