cache

This class handles a simple disk cache. It will download requested files and store them in folder specified by the user. If the file is requested a second time this class will serve the file directly from the file system. The path for caching is created out of the url of the file. For example, the file with the URL “https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/aapl-20180630.xml” will be stored in the disk cache in „D:/cache/www.sec.gov/Archives/edgar/data/320193/000032019318000100/aapl-20180630.xml“ where “D:/cache” is the caching directory specified by the user.

The http cache can also delay requests. This is highly recommended if you download xbrl submissions in batch! This class also provides a function for that xbrl.cache.HttpCache.cache_edgar_enclosure().

The SEC also emphasizes that you should try to keep the required server load on the EDGAR system as small as possible! https://www.sec.gov/privacy.htm#security

Short note on enclosures:

The SEC provides zip folders that contain all xbrl related files for a given submission. These files are i.e: Instance Document, Extension Taxonomy, Linkbases. Due to the fact that the zip compression is very effective on xbrl submissions that naturally contain repeating test, it is way more efficient to download the zip folder and extract it. So if you want to do the SEC servers and your downloading time a favour, use this method for downloading the submission :). One way to get the zip enclosure url is through the Structured Disclosure RSS Feeds provided by the SEC: https://www.sec.gov/structureddata/rss-feeds-submitted-filings

Parameters

class xbrl.cache.HttpCache(cache_dir, delay=500, verify_https=True)

Simple persistent HTTP cache. Requests files over http and stores them into the cache. Just returns the file path if the same file is requested twice. Also automatically handles retries when request fails.

__init__(cache_dir, delay=500, verify_https=True)

Parameters

cache_dir (str) – Root directory of the disk cache (all requested files will be cached in this directory)
delay (int) – Minimum time in milliseconds between two requests
verify_https (bool) – Disable SSL certificate validation for speed up (see https://github.com/manusimidt/py-xbrl/pull/57)

cache_edgar_enclosure(enclosure_url)

Downloads the ZIP folder, extracts it and stores the files in the cache.

Parameters: enclosure_url (str) – url to the zip folder.
Return type: str
Returns: relative path to extracted zip’s content

cache_file(file_url)

Caches a file in the http cache.

Parameters: file_url (str) – url (https link) to the file to be cached.
Return type: str
Returns: returns the absolute path to the cached file

purge_file(file_url)

Removes a file from the cache

Parameters: file_url (str) – url (https link) to the file to be deleted.
Return type: bool
Returns: true if the file was deleted, false if it could not be found

set_connection_params(delay=500, retries=5, backoff_factor=0.8, logs=True)

Sets the connection params for all following request

Parameters

delay (int) – Minimum time in milliseconds between two requests
retries (int) – int specifying how many times a request will be tried before assuming its failure.
backoff_factor (float) – Used to measure time to sleep between failed requests. The formula used is: {backoff factor} * (2 ** ({number of total retries} - 1))
logs (bool) – enables or disables download logs

Return type

None

set_headers(headers)

Sets the header for all following request

Parameters: headers (dict) – python dictionary with string key and value

Example header:

{
    "From": "pete.smith@example.com",
    "User-Agent" : "ExampleBot/1.0 (https.example.com/exampleBot)"
}

Return type: None

url_to_path(url)

Takes a url and converts it to the absolute local cache path

Parameters: url (str) – url of the file you want to know the cache path
Return type: str
Returns: absolute local cache path