cache
This class handles a simple disk cache. It will download requested files and store them in folder specified by the user. If the file is requested a second time this class will serve the file directly from the file system. The path for caching is created out of the url of the file. For example, the file with the URL “https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/aapl-20180630.xml” will be stored in the disk cache in „D:/cache/www.sec.gov/Archives/edgar/data/320193/000032019318000100/aapl-20180630.xml“ where “D:/cache” is the caching directory specified by the user.
The http cache can also delay requests. This is highly recommended if you download xbrl submissions in batch!
This class also provides a function for that xbrl.cache.HttpCache.cache_edgar_enclosure()
.
The SEC also emphasizes that you should try to keep the required server load on the EDGAR system as small as possible! https://www.sec.gov/privacy.htm#security
Short note on enclosures:
The SEC provides zip folders that contain all xbrl related files for a given submission. These files are i.e: Instance Document, Extension Taxonomy, Linkbases. Due to the fact that the zip compression is very effective on xbrl submissions that naturally contain repeating test, it is way more efficient to download the zip folder and extract it. So if you want to do the SEC servers and your downloading time a favour, use this method for downloading the submission :). One way to get the zip enclosure url is through the Structured Disclosure RSS Feeds provided by the SEC: https://www.sec.gov/structureddata/rss-feeds-submitted-filings
Parameters
- class xbrl.cache.HttpCache(cache_dir, delay=500, verify_https=True)
Simple persistent HTTP cache. Requests files over http and stores them into the cache. Just returns the file path if the same file is requested twice. Also automatically handles retries when request fails.
- __init__(cache_dir, delay=500, verify_https=True)
- Parameters
cache_dir (
str
) – Root directory of the disk cache (all requested files will be cached in this directory)delay (
int
) – Minimum time in milliseconds between two requestsverify_https (
bool
) – Disable SSL certificate validation for speed up (see https://github.com/manusimidt/py-xbrl/pull/57)
- cache_edgar_enclosure(enclosure_url)
Downloads the ZIP folder, extracts it and stores the files in the cache.
- Parameters
enclosure_url (
str
) – url to the zip folder.- Return type
str
- Returns
relative path to extracted zip’s content
- cache_file(file_url)
Caches a file in the http cache.
- Parameters
file_url (
str
) – url (https link) to the file to be cached.- Return type
str
- Returns
returns the absolute path to the cached file
- purge_file(file_url)
Removes a file from the cache
- Parameters
file_url (
str
) – url (https link) to the file to be deleted.- Return type
bool
- Returns
true if the file was deleted, false if it could not be found
- set_connection_params(delay=500, retries=5, backoff_factor=0.8, logs=True)
Sets the connection params for all following request
- Parameters
delay (
int
) – Minimum time in milliseconds between two requestsretries (
int
) – int specifying how many times a request will be tried before assuming its failure.backoff_factor (
float
) – Used to measure time to sleep between failed requests. The formula used is: {backoff factor} * (2 ** ({number of total retries} - 1))logs (
bool
) – enables or disables download logs
- Return type
None
- set_headers(headers)
Sets the header for all following request
- Parameters
headers (
dict
) – python dictionary with string key and value
Example header:
{ "From": "pete.smith@example.com", "User-Agent" : "ExampleBot/1.0 (https.example.com/exampleBot)" }
- Return type
None
- url_to_path(url)
Takes a url and converts it to the absolute local cache path
- Parameters
url (
str
) – url of the file you want to know the cache path- Return type
str
- Returns
absolute local cache path