RUN: export HERITRIX_HOME=/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT $HERITRIX_HOME/bin/heritrix -a admin:admin Visit: https://localhost:8443/engine MORE READING: http://open-s.com/en/content/heritrix-configuration http://open-s.com/en/content/wget https://superuser.com/questions/655346/wget-execute-script-after-download https://www.gnu.org/software/wget/manual/wget.html ‘--execute command Execute command as if it were a part of .wgetrc (see Startup File). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. If you need to specify more than one wgetrc command, use multiple instances of ‘-e’. "crawler TRAPS": https://www.contentkingapp.com/academy/crawler-traps/ https://www.billhartzer.com/internet-marketing/crawl-thousands-urls/ ----------------------------------------- Scope DecideReject Configuration rules ----------------------------------------- Issues in H3.docx https://sbforge.org/download/attachments/.../Issues%20in%20H3.docx?version=1... By using maxTransHops and maxSpeculativeHops we thought that we could manage how long our discovery path 'X' should be, but we see different results and ... DOCX: https://sbforge.org/download/attachments/21856421/Issues%20in%20H3.docx?version=1&modificationDate=1465912557659&api=v2&usg=AOvVaw03EWO6Xy0XiMRITvFwR2v0 [ https://webcache.googleusercontent.com/search?q=cache:bEM1AdjQR2cJ:https://sbforge.org/download/attachments/21856421/Issues%2520in%2520H3.docx%3Fversion%3D1%26modificationDate%3D1465912557659%26api%3Dv2+&cd=10&hl=en&ct=clnk&gl=nz&client=ubuntu ] Speculative Hops No really effect using maxTransHops or maxSpeculativeHops By using maxTransHops and maxSpeculativeHops we thought that we could manage how long our discovery path ‘X’ should be, but we see different results and we still harvest several ‘XX’ or more. To accomplish this we use HopsPathMatchesDecideRule but we haven’t found a specific path that we certainly can say we want excluded from our harvest. We tried to make regex with R, E and X. Does anyone have experience with this? Path seeds Harvesting specific URI paths that doesn’t end with slashes When adding path seeds to only harvest from a specific place on a domain we sometimes have problems with redirections. We always end path seeds with a slash, but sometimes we are redirected e.g. HTTP 301 http://www.bt.dk/plus/ → http://www.bt.dk/plus. If we have following as a seed http://www.bt.dk/plus we will harvest the whole site because H3 harvest from the last slash in the seed? Any experience with path seeds? -------------------------------------------------------------------------------- TODO: https://www.stat.auckland.ac.nz/~paul/Reports/maori/maori.html The macron is the only accent required for written Māori and the accent can only be applied to vowels, so the full set of accented characters are: lower case a, with macron ā upper case A, with macron Ā lower case e, with macron ē upper case E, with macron Ē lower case i, with macron ī upper case I, with macron Ī lower case o, with macron ō upper case O, with macron Ō lower case u, with macron ū upper case U, with macron Ū http://emacs.1067599.n8.nabble.com/Entering-vowels-with-macrons-td72136.html Ctrl+\ type rfc1345 (and enter) type &a- to get a-macron Then Ctrl+\ to toggle back to default input (Can thereafter toggle with Ctrl+\ to get back to rfc1345 input method) https://sachachua.com/blog/2011/04/writing-macrons-linux-latin-pronunciation/ To add macrons: Ctrl-\ "latin-alt-postfix". But doesn't have all macronised vowels as used in te reo. Then Ctrl-\ to get default input editor back. https://www.gnu.org/software/emacs/manual/html_node/emacs/Select-Input-Method.html DOWNLOAD HERITRIX: BINARY: http://builds.archive.org/maven2/org/archive/heritrix/heritrix/3.4.0-SNAPSHOT/ CODE, STATIC: https://github.com/internetarchive/heritrix3 https://github.com/internetarchive/heritrix3/wiki/How%20To%20Crawl WEB CURATOR TOOL: http://dia-nz.github.io/webcurator/ https://webcuratortool.readthedocs.io/en/latest/guides/quick-start-guide.html https://webcuratortool.readthedocs.io/en/latest/guides/overview-history.html Crawl as much of the nz domains as we can and run the language detection for Maori on the pages that come through and only save those pages and configure (somehow) that it knows that it doesn't need to redownload pages already *inspected* for language (not just that it detects it doesn't need to redownload the *stored* pages, since we only store mri language pages and not all pages inspected) Then break up the pages by sentences using our SentenceDetector model. SURT urls: http://crawler.archive.org/apidocs/org/archive/util/SURT.html WARC https://en.wikipedia.org/wiki/Web_ARChive Q: "The harvested material is captured in ARC/WARC format which has strong storage and archiving characteristics." at https://webcuratortool.readthedocs.io/en/latest/guides/user-manual.html https://blogs.loc.gov/thesignal/2013/11/anatomy-of-a-web-archive/ Nicholas Taylor November 13, 2013 at 2:49 pm Hi Ross, thanks for the comment. The tools for personal archiving of web pages and websites to WARC format are getting better, with the capture side further along than the replay side. Archive Ready (http://archiveready.com/) and WARCreate (http://warcreate.com/) can both be used to create a WARC containing all of the objects that make up an individual web page. GNU Wget 1.14+ (http://www.archiveteam.org/index.php?title=Wget_with_WARC_output) and WAIL (http://matkelly.com/wail/) can both be used to capture entire websites to WARC. WAIL also bundles a standalone Wayback Machine that runs locally, which is the easiest way I know of for users to view the content they’ve collected in WARC format. https://webcuratortool.readthedocs.io/en/latest/guides/user-manual.html Doesn't mention https "How targets work Targets consist of several important elements, including a name and description for internal use; a set of Seed URLs, ******a web harvester profile that controls the behaviour of the web crawler during the harvest******, one or more schedules that specify when the Target will be harvested, and (optionally) a set of descriptive metadata for the Target." Harvestor Configuration section: "The remaining tabs Pre-fetchers, Fetchers, Extractors, Writers, and Post-Processors are a series of processors that a URI passes through when it is crawled." http://crawler.archive.org/articles/user_manual/config.html Look for post-process*. Found under: 6.1.3. Processing Chains 6.1.2. Frontier The Frontier is a pluggable module that maintains the internal state of the crawl. What URIs have been discovered, crawled etc. As such its selection greatly effects, for instance, the order in which discovered URIs are crawled. There is only one Frontier per crawl job. Multiple Frontiers are provided with Heritrix, each of a particular character. 6.1.2.1. BdbFrontier The default Frontier in Heritrix as of 1.4.0 and later is the BdbFrontier(Previously, the default was the Section 6.1.2.2, “HostQueuesFrontier”). The BdbFrontier visits URIs and sites discovered in a generally breadth-first manner, it offers configuration options controlling how it throttles its activity against particular hosts, and whether it has a bias towards finishing hosts in progress ('site-first' crawling) or cycling among all hosts with pending URIs. Discovered URIs are only crawled once, except that robots.txt and DNS information can be configured so that it is refreshed at specified intervals for each host. The main difference between the BdbFrontier and its precursor, Section 6.1.2.2, “HostQueuesFrontier”, is that BdbFrontier uses BerkeleyDB Java Edition to shift more running Frontier state to disk. 6.1.2.2. HostQueuesFrontier The forerunner of the Section 6.1.2.1, “BdbFrontier”. Now deprecated mostly because its custom disk-based data structures could not move as much Frontier state out of main memory as the BerkeleyDB Java Edition approach. Has same general characteristics as the Section 6.1.2.1, “BdbFrontier”. https://webcuratortool.readthedocs.io/en/latest/guides/quick-start-guide.html You can use OpenWayback to view harvests from within WCT, see the wiki on the WCT Github page: https://github.com/DIA-NZ/webcurator/wiki/Wayback-Integration https://webarchive.jira.com/wiki/spaces/Heritrix/overview https://github.com/internetarchive/heritrix3/wiki/Heritrix3 "Unlike with previous releases, the web control interface is only made available via secure-socket HTTPS, and corresponding to this change the default port has changed to 8443. Additionally, unless you supply a compatible keystore via the new optional '-s' command-line switch, an 'ad-hoc' keystore with a new locally-generated SSL-capable certificate will be created (and then reused on future launches). To then contact the web interface from a browser running on the same machine, visit the URL: https://localhost:8443/ " LOCALHOST WITH HTTPS is possible??? vs https://letsencrypt.org/docs/certificates-for-localhost/ "For local development If you’re developing a web app, it’s useful to run a local web server like Apache or Nginx, and access it via http://localhost:8000/ in your web browser. However, web browsers behave in subtly different ways on HTTP vs HTTPS pages. The main difference: On an HTTPS page, any requests to load JavaScript from an HTTP URL will be blocked. So if you’re developing locally using HTTP, you might add a script tag that works fine on your development machine, but breaks when you deploy to your HTTPS production site. To catch this kind of problem, it’s useful to set up HTTPS on your local web server. However, you don’t want to see certificate warnings all the time. How do you get the green lock locally? The best option: Generate your own certificate, either self-signed or signed by a local root, and trust it in your operating system’s trust store. Then use that certificate in your local web server. See below for details." [Googled: https certificate localhost https://www.freecodecamp.org/news/how-to-get-https-working-on-your-local-development-environment-in-5-minutes-7af615770eec/ ] https://github.com/internetarchive/heritrix3/wiki/Heritrix%20Output#HeritrixOutput-WARCfiles * source-report.txt This report contains a line item for each host, which includes the seed from which the host was reached. Note The sourceTagSeeds property of the TextSeedModule bean must be set to true for this report to be generated. * WARC files Assuming you are using the WARC writer that comes with Heritrix, a number of WARC files will be generated containing crawled content. You can specify the storage location of WARC files by setting the directory value of the WARCWriterProcessor bean. https://github.com/internetarchive/heritrix3/wiki/Archiving%20Rich-Media%20Content Large File Sizes Rich-media content, such as Flash and video, is usually much larger than standard text/html pages. Crawling such content requires large investments in storage and bandwidth. To mitigate these issues, deduplication is recommended for rich-media crawls. Deduplication detects previously collected content that is redundant and skips the download of such content. Pointers to the duplicate content allow it to appear in subsequent crawls. For details see Configuring Heritrix for Deduplication. Excessive Memory and CPU Usage Downloading rich-media content can often cause excessive load to be placed on the crawling computers memory and CPU. For example, extracting links from Flash and other rich-media resources requires extensive data parsing, which is CPU intensive. Atypical input patterns can also cause excessive CPU usage when regular expressions used by Heritrix are run. It is therefore recommended that rich-media crawls be allocated more memory and CPU than "normal" crawls. The memory allocated to Heritrix is set from the command line. The following example shows the command line option to allocate 1 GB of memory to Heritrix, which should be sufficient for most rich-media crawls. export JAVA_OPTS=-Xmx1024M Multi-core processors are also recommended for rich-media crawls. Streaming media and Social Networking Sites Many social networking sites make use of rich-media to enhance their user-experience. For specific guidelines on archiving social media sites see Archiving Social Networking Sites with Archive-It . These instructions apply to the Archive-It application, which is built on top of Heritrix. Q: https://github.com/internetarchive/heritrix3/wiki/Avoiding%20Too%20Much%20Dynamic%20Content "To allow both foo.org and www.foo.org to be captured, you could add two seeds: http://www.foo.org/ and http://foo.org/. To allow every subdomain of foo.org to be crawled, you can add the seed http://foo.org. Note the absence of a trailing slash." (Does the latter encompass both of the former?) Delete the TranclusionDecideRule, since this rule has the potential to lead Heritrix onto another host. For example, if a URI returns a 301 response code (move permanently) or 302 (found) response code as well as a URI that contains a different host name than the seeds, Heritrix would accept this URI using the TransclusionDecideRule. Removing this rule will keep Heritrix from straying off of our www.foo.org host. ... Alternately, you can add the MatchesFilePatternDecideRule. Set usePresetPattern to CUSTOM and set the regexp to something like: .foo\.org(?!/calendar).|.*foo\.org/calendar?year=200[56].* https://github.com/internetarchive/heritrix3/wiki/Mirroring%20HTML%20Files%20Only Mirroring HTML Files Only Alex Osborne edited this page on Jul 4, 2018 · 2 revisions Suppose you only want to crawl URIs that match http://foo.org/bar/\*.html. Also, you would like to save the crawled files in a file/directory format instead of saving them in WARC files. Also, assume the web server is case-sensitive. For example, http://foo.org/bar/abc.html and http://foo.org/bar/ABC.HTML are pointing to two different resources. !! [If Heritrix needs to] be configured to differentiate between abc.html and ABC.HTML. Do this by removing the LowercaseRule from the canonicalizationPolicy bean. https://github.com/internetarchive/heritrix3/wiki/Only%20Store%20Successful%20HTML%20Pages https://github.com/internetarchive/heritrix3/wiki/Jobs [multiple URLs to crawl can be specified. But what is the separator] Look up: Spring framework, Spring beans https://localhost:8443/engine/job/pinky/jobdir/crawler-beans.cxml?format=textedit https://github.com/internetarchive/heritrix3/wiki/Fetch%20Chain%20Processors fetchHttp This processor fetches HTTP URIs. As of Heritrix 3.1, the crawler will now properly decode 'chunked' Transfer-Encoding -- even if encountered when it should not be used, as in a response to an HTTP/1.0 request. Additionally, the fetchHttp processor now includes the parameter 'useHTTP11', which if true, will cause Heritrix to report its requests as 'HTTP/1.1'. This allows sites to use the 'chunked' Transfer-Encoding. (The default for this parameter is false for now, and Heritrix still does not reuse a persistent connection for more than one request to a site.) fetchHttp also includes the parameter 'acceptCompression', which if true, will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header, which offers to receive compressed responses. (The default for this parameter is false for now.) extractorHttp This processor extracts outlinks from HTTP headers. As of Heritrix 3.1, the extractorHttp processor now considers any URI on a hostname to imply that the '/favicon.ico' from the same host should be fetched. Also, as of Heritrix 3.1, the "inferRootPage" property has been added to the extractorHttp bean. If this property is "true", Heritrix infers the '/' root page from any other URI on the same hostname. The default for this setting is "false", which means the pre-3.1 behavior of only fetching the root page if it is a seed or otherwise discovered and in-scope remains in effect. Discovery via these new heuristics is considered to be a new 'I' (inferred) hop-type, and is treated the same in scoping/transclusion decisions as an 'E' (embed). https://github.com/internetarchive/heritrix3/wiki/Processor%20Settings fetchHttp: timeoutSeconds - This setting determines how long an HTTP request will wait for a resource to respond. This setting should be set to a high value. defaultEncoding - The character encoding to use for files that do not have one specified in the HTTP response headers. The default is ISO-8859 -1. soTimeoutMs - If the socket is unresponsive for this number of milliseconds, the request is cancelled. Setting the value to zero (no timeout) is not recommended as it could hang a thread on an unresponsive server. This timeout is used to time out socket opens and socket reads. Make sure this value is less than timeoutSeconds for optimal configuration. This ensures at least one retry read. sendIfModifiedSince - Send If-Modified-Since header, if previous Last-Modified fetch history information is available in URI history. sendIfNoneMatch - Send If-None-Match header, if previous Etag fetch history information is available in URI history. sendConnectionClose - Send Connection: close header with every request. w3.org connection header documentation sendRange- Send the Range header when there is a limit on the retrieved document size. This is for politeness purposes. The Range header states that only the first n bytes are of interest. It is only pertinent if maxLengthBytes is greater than zero. Sending the Range header results in a 206 Partial Content status response, which is better than cutting the response mid-download. On rare occasion, sending the Range header will generate 416 Request Range Not Satisfiable response. acceptHeaders - Accept Headers to include in each request. Each must be the complete header, e.g., Accept-Language: en. (mi is the 2 letter code for Maori, see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept ExtractorHtml: extractJavascript - If true, in-page Javascript is scanned for strings that appear to be URIs. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs can generate webmaster concern over odd crawler behavior. Default is true. extractValueAttributes- If true, strings that look like URIs found in unusual places (such as form VALUE attributes) will be extracted. This typically finds both valid and invalid URIs. Attempts to fetch the invalid URIs may generate webmaster concerns over odd crawler behavior. Default is true. ignoreFormActionUrls - If true, URIs appearing as the ACTION attribute in HTML FORMs are ignored. Default is false. extractOnlyFormGets - If true, only ACTION URIs with a METHOD of GET (explicit or implied) are extracted. Default is true. candidates seedsRedirectNewSeeds - If enabled, any URI found because a seed redirected to it (original seed returned 301 or 302), will also be treated as a seed. https://github.com/internetarchive/heritrix3/wiki/Statistics%20Tracking Statistics Tracking Alex Osborne edited this page on Jul 4, 2018 · 2 revisions Any number of statistics tracking modules can be attached to a crawl. Currently only one is provided with Heritrix. The statisticsTracker Spring bean that comes with Heritrix creates the progress-statistics.log file and provides the WUI with data to display progress information about the crawl. It is strongly recommended that any crawl run through the WUI use this bean. https://github.com/internetarchive/heritrix3/wiki/Configuring-Crawl-Scope-Using-DecideRules ------------ REST API: https://heritrix.readthedocs.io/en/latest/api.html Execute Script in Job POST https://(heritrixhost):8443/engine/job/(jobname)/script Executes a script. The script can be written as Beanshell, ECMAScript, Groovy, or AppleScript. https://github.com/beanshell/beanshell https://github.com/internetarchive/heritrix3/wiki/BeanShell%20Script%20For%20Downloading%20Video https://github.com/internetarchive/heritrix3/wiki/Heritrix3-Useful-Scripts ----------- LOGGING https://github.com/internetarchive/heritrix3/wiki/Configuring%20Crawl%20Scope%20Using%20DecideRules "DecideRuleSequence Logging Enable FINEST logging on the class org.archive.crawler.deciderules.DecideRuleSequence to watch each DecideRule's evaluation of the processed URI. This can be done in the logging.properties file logging.properties org.archive.modules.deciderules.DecideRuleSequence.level = FINEST in conjunction with the -Dsysprop VM argument -Djava.util.logging.config.file=/path/to/heritrix3/dist/src/main/conf/logging.properties " I couldn't get the above logging instructions to work, but here's what I did. a. I modified conf/logging.properties by adding: # PINKY # DecideRuleSequence Logging # https://github.com/internetarchive/heritrix3/wiki/Configuring-Crawl-Scope-Using-DecideRules org.archive.modules.deciderules.DecideRuleSequence.level = FINEST b. I opened the bin/heritrix script and edited in the following into the 2 locations using $JAVACMD: CLASSPATH=${CP} nohup $JAVACMD -Djava.util.logging.config.file=/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT/conf/logging.properties -Dheritrix.home=${HERITRIX_HOME}