Module

class Module(base: str | Path, ensure_exists: bool = True)[source]

Bases: object

The class wrapping the directory lookup implementation.

Initialize the module.

Parameters:
  • base – The base directory for the module

  • ensure_exists – Should the base directory be created automatically? Defaults to true.

Methods Summary

dump_df(*subkeys, name, obj[, sep, index, ...])

Dump a dataframe to a TSV file with pandas.

dump_json(*subkeys, name, obj[, ...])

Dump an object to a file with json.

dump_pickle(*subkeys, name, obj[, mode, ...])

Dump an object to a file with pickle.

dump_rdf(*subkeys, name, obj[, format, ...])

Dump an RDF graph to a file with rdflib.

dump_xml(*subkeys, name, obj[, open_kwargs, ...])

Dump an XML element tree to a file with lxml.

ensure(*subkeys, url[, name, version, ...])

Ensure a file is downloaded.

ensure_csv(*subkeys, url[, name, force, ...])

Download a CSV and open as a dataframe with pandas.

ensure_custom(*subkeys, name[, force])

Ensure a file is present, and run a custom create function otherwise.

ensure_excel(*subkeys, url[, name, force, ...])

Download an excel file and open as a dataframe with pandas.

ensure_from_google(*subkeys, name, file_id)

Ensure a file is downloaded from Google Drive.

ensure_from_s3(*subkeys, s3_bucket, s3_key)

Ensure a file is downloaded.

ensure_gunzip(*subkeys, url[, name, force, ...])

Ensure a tar.gz file is downloaded and unarchived.

ensure_json(*subkeys, url[, name, force, ...])

Download JSON and open with json.

ensure_json_bz2(*subkeys, url[, name, ...])

Download BZ2-compressed JSON and open with json.

ensure_open(-> Generator[StringIO, None, None])

Ensure a file is downloaded and open it.

ensure_open_bz2(*subkeys, url[, name, ...])

Ensure a BZ2-compressed file is downloaded and open a file inside it.

ensure_open_gz(-> Generator[StringIO, None, ...)

Ensure a gzipped file is downloaded and open a file inside it.

ensure_open_lzma(...)

Ensure a LZMA-compressed file is downloaded and open a file inside it.

ensure_open_sqlite(*subkeys, url[, name, ...])

Ensure and connect to a SQLite database.

ensure_open_sqlite_gz(*subkeys, url[, name, ...])

Ensure and connect to a SQLite database that's gzipped.

ensure_open_tarfile(...)

Ensure a tar file is downloaded and open a file inside it.

ensure_open_zip(...)

Ensure a file is downloaded then open it with zipfile.

ensure_pickle(*subkeys, url[, name, force, ...])

Download a pickle file and open with pickle.

ensure_pickle_gz(*subkeys, url[, name, ...])

Download a gzipped pickle file and open with pickle.

ensure_rdf(*subkeys, url[, name, force, ...])

Download a RDF file and open with rdflib.

ensure_soup(*subkeys, url[, name, version, ...])

Ensure a webpage is downloaded and parsed with BeautifulSoup.

ensure_tar_df(*subkeys, url, inner_path[, ...])

Download a tar file and open an inner file as a dataframe with pandas.

ensure_tar_xml(*subkeys, url, inner_path[, ...])

Download a tar file and open an inner file as an XML with lxml.

ensure_untar(*subkeys, url[, name, ...])

Ensure a tar file is downloaded and unarchived.

ensure_xml(*subkeys, url[, name, force, ...])

Download an XML file and open it with lxml.

ensure_yaml(*subkeys, url[, name, force, ...])

Download YAML and open with yaml.

ensure_zip_df(*subkeys, url, inner_path[, ...])

Download a zip file and open an inner file as a dataframe with pandas.

ensure_zip_np(*subkeys, url, inner_path[, ...])

Download a zip file and open an inner file as an array-like with numpy.

from_key(key, *subkeys[, ensure_exists])

Get a module for the given directory or one of its subdirectories.

join(-> tuple[~pathlib.Path, ...)

Get a subdirectory of the current module.

joinpath_sqlite(*subkeys, name)

Get an SQLite database connection string.

load_df(*subkeys, name[, read_csv_kwargs])

Open a pre-existing CSV as a dataframe with pandas.

load_json(*subkeys, name[, open_kwargs, ...])

Open a JSON file json.

load_pickle(*subkeys, name[, mode, ...])

Open a pickle file with pickle.

load_pickle_gz(*subkeys, name[, mode, ...])

Open a gzipped pickle file with pickle.

load_rdf(*subkeys[, name, parse_kwargs])

Open an RDF file with rdflib.

load_xml(*subkeys, name[, parse_kwargs])

Load an XML file with lxml.

load_yaml(*subkeys, name[, open_kwargs, ...])

Open a JSON file json.

module(*subkeys[, ensure_exists])

Get a module for a subdirectory of the current module.

open(-> Generator[StringIO, None, None])

Open a file.

open_gz(-> Generator[StringIO, None, None])

Open a gzipped file that exists already.

Methods Documentation

dump_df(*subkeys: str, name: str, obj: pd.DataFrame, sep: str = '\t', index: bool = False, to_csv_kwargs: Mapping[str, Any] | None = None) None[source]

Dump a dataframe to a TSV file with pandas.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • obj – The dataframe to dump

  • sep – The separator to use, defaults to a tab

  • index – Should the index be dumped? Defaults to false.

  • to_csv_kwargs – Keyword arguments to pass through to pandas.DataFrame.to_csv().

dump_json(*subkeys: str, name: str, obj: Any, open_kwargs: Mapping[str, Any] | None = None, json_dump_kwargs: Mapping[str, Any] | None = None) None[source]

Dump an object to a file with json.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • obj – The object to dump

  • open_kwargs – Additional keyword arguments passed to open()

  • json_dump_kwargs – Keyword arguments to pass through to json.dump().

dump_pickle(*subkeys: str, name: str, obj: Any, mode: Literal['wb'] = 'wb', open_kwargs: Mapping[str, Any] | None = None, pickle_dump_kwargs: Mapping[str, Any] | None = None) None[source]

Dump an object to a file with pickle.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • obj – The object to dump

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to open()

  • pickle_dump_kwargs – Keyword arguments to pass through to pickle.dump().

dump_rdf(*subkeys: str, name: str, obj: rdflib.Graph, format: str = 'turtle', serialize_kwargs: Mapping[str, Any] | None = None) None[source]

Dump an RDF graph to a file with rdflib.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • obj – The object to dump

  • format – The format to dump in

  • serialize_kwargs – Keyword arguments to through to rdflib.Graph.serialize().

dump_xml(*subkeys: str, name: str, obj: lxml.etree.ElementTree, open_kwargs: Mapping[str, Any] | None = None, write_kwargs: Mapping[str, Any] | None = None) None[source]

Dump an XML element tree to a file with lxml.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • obj – The object to dump

  • open_kwargs – Additional keyword arguments passed to open()

  • write_kwargs – Keyword arguments to pass through to lxml.etree.ElementTree.write().

ensure(*subkeys: str, url: str, name: str | None = None, version: None | str | Callable[[], str | None] = None, force: bool = False, download_kwargs: DownloadKwargs | None = None) Path[source]

Ensure a file is downloaded.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • version

    The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

    The following example describes how to store the versioned data from the Rhea database for biologically relevant chemical reactions.

    import pystow
    import requests
    
    def get_rhea_version() -> str:
        res = requests.get("https://ftp.expasy.org/databases/rhea/rhea-release.properties")
        _, _, version = res.text.splitlines()[0].partition("=")
        return version
    
    module = pystow.module("rhea")
    path = module.ensure(
        url="ftp://ftp.expasy.org/databases/rhea/rdf/rhea.rdf.gz",
        version=get_rhea_version,
    )
    

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

Returns:

The path of the file that has been downloaded (or already exists)

ensure_csv(*subkeys: str, url: str, name: str | None = None, force: bool = False, version: VersionHint = None, download_kwargs: DownloadKwargs | None = None, read_csv_kwargs: Mapping[str, Any] | None = None) pd.DataFrame[source]

Download a CSV and open as a dataframe with pandas.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • read_csv_kwargs – Keyword arguments to pass through to pandas.read_csv().

Returns:

A pandas DataFrame

ensure_custom(*subkeys: str, name: str, force: bool = False, provider: Callable[[...], None], **kwargs: Any) Path[source]

Ensure a file is present, and run a custom create function otherwise.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The file name.

  • force – Should the file be re-created, even if the path already exists?

  • provider – The file provider. Will be run with the path as the first positional argument, if the file needs to be generated.

  • kwargs – Additional keyword-based parameters passed to the provider.

Returns:

The path of the file that has been created (or already exists)

Raises:

ValueError – If the provider was called but the file was not created by it.

ensure_excel(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, read_excel_kwargs: Mapping[str, Any] | None = None) pd.DataFrame[source]

Download an excel file and open as a dataframe with pandas.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • read_excel_kwargs – Keyword arguments to pass through to pandas.read_excel().

Returns:

A pandas DataFrame

ensure_from_google(*subkeys: str, name: str, file_id: str, force: bool = False, download_kwargs: Mapping[str, Any] | None = None) Path[source]

Ensure a file is downloaded from Google Drive.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file

  • file_id – The file identifier of the Google file. If your share link is https://drive.google.com/file/d/1AsPPU4ka1Rc9u-XYMGWtvV65hF3egi0z/view, then your file ID is 1AsPPU4ka1Rc9u-XYMGWtvV65hF3egi0z.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download_from_google().

Returns:

The path of the file that has been downloaded (or already exists)

ensure_from_s3(*subkeys: str, s3_bucket: str, s3_key: str | Sequence[str], name: str | None = None, client: botocore.client.BaseClient | None = None, client_kwargs: Mapping[str, Any] | None = None, download_file_kwargs: Mapping[str, Any] | None = None, force: bool = False) Path[source]

Ensure a file is downloaded.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • s3_bucket – The S3 bucket name

  • s3_key – The S3 key name

  • name – Overrides the name of the file at the end of the S3 key, if given.

  • client – A botocore client. If none given, one will be created automatically

  • client_kwargs – Keyword arguments to be passed to the client on instantiation.

  • download_file_kwargs – Keyword arguments to be passed to boto3.s3.transfer.S3Transfer.download_file()

  • force – Should the download be done again, even if the path already exists? Defaults to false.

Returns:

The path of the file that has been downloaded (or already exists)

ensure_gunzip(*subkeys: str, url: str, name: str | None = None, force: bool = False, autoclean: bool = True, download_kwargs: DownloadKwargs | None = None) Path[source]

Ensure a tar.gz file is downloaded and unarchived.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • autoclean – Should the zipped file be deleted?

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

Returns:

The path of the directory where the file that has been downloaded gets extracted to

ensure_json(*subkeys: str, url: str, name: str | None = None, force: bool = False, version: None | str | Callable[[], str | None] = None, download_kwargs: DownloadKwargs | None = None, open_kwargs: Mapping[str, Any] | None = None, json_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Download JSON and open with json.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • open_kwargs – Additional keyword arguments passed to open()

  • json_load_kwargs – Keyword arguments to pass through to json.load().

Returns:

A JSON object (list, dict, etc.)

ensure_json_bz2(*subkeys: str, url: str, name: str | None = None, force: bool = False, version: None | str | Callable[[], str | None] = None, download_kwargs: DownloadKwargs | None = None, open_kwargs: Mapping[str, Any] | None = None, json_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Download BZ2-compressed JSON and open with json.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • open_kwargs – Additional keyword arguments passed to bz2.open()

  • json_load_kwargs – Keyword arguments to pass through to json.load().

Returns:

A JSON object (list, dict, etc.)

ensure_open(*subkeys: str, url: str, name: str | None, version: VersionHint = None, force: bool, download_kwargs: DownloadKwargs | None, mode: Literal['r', 'rt', 'w', 'wt'] = 'r', open_kwargs: Mapping[str, Any] | None) Generator[StringIO, None, None][source]
ensure_open(*subkeys: str, url: str, name: str | None, version: VersionHint = None, force: bool, download_kwargs: DownloadKwargs | None, mode: Literal['rb', 'wb'] = 'r', open_kwargs: Mapping[str, Any] | None) Generator[BytesIO, None, None]

Ensure a file is downloaded and open it.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to open()

Yields:

An open file object

ensure_open_bz2(*subkeys: str, url: str, name: str | None = None, force: bool = False, version: None | str | Callable[[], str | None] = None, download_kwargs: DownloadKwargs | None = None, mode: Literal['rb'] = 'rb', open_kwargs: Mapping[str, Any] | None = None) Generator[BZ2File, None, None][source]

Ensure a BZ2-compressed file is downloaded and open a file inside it.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to bz2.open()

  • open_kwargs – Additional keyword arguments passed to bz2.open()

Yields:

An open file object

ensure_open_gz(*subkeys: str, url: str, name: str | None, force: bool, download_kwargs: DownloadKwargs | None, mode: Literal['r', 'w', 'rt', 'wt'] = 'rb', open_kwargs: Mapping[str, Any] | None) Generator[StringIO, None, None][source]
ensure_open_gz(*subkeys: str, url: str, name: str | None, force: bool, download_kwargs: DownloadKwargs | None, mode: Literal['rb', 'wb'] = 'rb', open_kwargs: Mapping[str, Any] | None) Generator[BytesIO, None, None]

Ensure a gzipped file is downloaded and open a file inside it.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to gzip.open()

  • open_kwargs – Additional keyword arguments passed to gzip.open()

Yields:

An open file object

ensure_open_lzma(*subkeys: str, url: str, name: str | None, force: bool, download_kwargs: DownloadKwargs | None, mode: Literal['r', 'w', 'rt', 'wt'] = 'rt', open_kwargs: Mapping[str, Any] | None) Generator[io.TextIOWrapper[lzma.LZMAFile], None, None][source]
ensure_open_lzma(*subkeys: str, url: str, name: str | None, force: bool, download_kwargs: DownloadKwargs | None, mode: Literal['rb', 'wb'] = 'rt', open_kwargs: Mapping[str, Any] | None) Generator[lzma.LZMAFile, None, None]

Ensure a LZMA-compressed file is downloaded and open a file inside it.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to lzma.open()

  • open_kwargs – Additional keyword arguments passed to lzma.open()

Yields:

An open file object

ensure_open_sqlite(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None) Generator[Connection, None, None][source]

Ensure and connect to a SQLite database.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

Yields:

An instance of sqlite3.Connection from sqlite3.connect()

Example usage: >>> import pystow >>> import pandas as pd >>> url = “https://s3.amazonaws.com/bbop-sqlite/obi.db” >>> sql = “SELECT * FROM entailed_edge LIMIT 10” >>> module = pystow.module(“test”) >>> with module.ensure_open_sqlite(url=url) as conn: >>> df = pd.read_sql(sql, conn)

ensure_open_sqlite_gz(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None) Generator[Connection, None, None][source]

Ensure and connect to a SQLite database that’s gzipped.

Unfortunately, it’s a paid feature to directly read gzipped sqlite files, so this automatically gunzips it first.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

Yields:

An instance of sqlite3.Connection from sqlite3.connect()

Example usage: >>> import pystow >>> import pandas as pd >>> url = “https://s3.amazonaws.com/bbop-sqlite/hp.db.gz” >>> module = pystow.module(“test”) >>> sql = “SELECT * FROM entailed_edge LIMIT 10” >>> with module.ensure_open_sqlite_gz(url=url) as conn: >>> df = pd.read_sql(sql, conn)

ensure_open_tarfile(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['rt'] = 'r', open_kwargs: Mapping[str, Any] | None = None) Generator[TextIO, None, None][source]
ensure_open_tarfile(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['r', 'rb'] = 'r', open_kwargs: Mapping[str, Any] | None = None) Generator[IO[bytes], None, None]

Ensure a tar file is downloaded and open a file inside it.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • inner_path – The relative path to the file inside the archive

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to tarfile.open()

  • open_kwargs – Additional keyword arguments passed to tarfile.open()

Yields:

An open file object

ensure_open_zip(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['r', 'rb'] = 'r', open_kwargs: Mapping[str, Any] | None = None) Generator[BinaryIO, None, None][source]
ensure_open_zip(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['rt'] = 'r', open_kwargs: Mapping[str, Any] | None = None) Generator[TextIO, None, None]

Ensure a file is downloaded then open it with zipfile.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • inner_path – The relative path to the file inside the archive

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to zipfile.open(). Defaults to bytes mode for r and w.

  • zipfile_kwargs – Additional keyword arguments passed to zipfile.ZipFile

  • open_kwargs – Additional keyword arguments passed to zipfile.open()

Yields:

An open file object

ensure_pickle(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['rb'] = 'rb', open_kwargs: Mapping[str, Any] | None = None, pickle_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Download a pickle file and open with pickle.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to open()

  • pickle_load_kwargs – Keyword arguments to pass through to pickle.load().

Returns:

Any object

ensure_pickle_gz(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['rb'] = 'rb', open_kwargs: Mapping[str, Any] | None = None, pickle_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Download a gzipped pickle file and open with pickle.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to gzip.open()

  • open_kwargs – Additional keyword arguments passed to gzip.open()

  • pickle_load_kwargs – Keyword arguments to pass through to pickle.load().

Returns:

Any object

ensure_rdf(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, precache: bool = True, parse_kwargs: Mapping[str, Any] | None = None) rdflib.Graph[source]

Download a RDF file and open with rdflib.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • precache – Should the parsed rdflib.Graph be stored as a pickle for fast loading?

  • parse_kwargs – Keyword arguments to pass through to pystow.utils.read_rdf() and transitively to rdflib.Graph.parse().

Returns:

An RDF graph

ensure_soup(*subkeys: str, url: str, name: str | None = None, version: VersionHint = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, mode: Literal['r', 'rt', 'w', 'wt'] | Literal['rb', 'wb'] = 'r', open_kwargs: Mapping[str, Any] | None = None, beautiful_soup_kwargs: Mapping[str, Any] | None = None) bs4.BeautifulSoup[source]

Ensure a webpage is downloaded and parsed with BeautifulSoup.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • version – The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to open()

  • beautiful_soup_kwargs – Additional keyword arguments passed to BeautifulSoup

Returns:

An BeautifulSoup object

Note

If you don’t need to cache, consider using pystow.utils.get_soup() instead.

ensure_tar_df(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, read_csv_kwargs: Mapping[str, Any] | None = None) pd.DataFrame[source]

Download a tar file and open an inner file as a dataframe with pandas.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • inner_path – The relative path to the file inside the archive

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • read_csv_kwargs – Keyword arguments to pass through to pandas.read_csv().

Returns:

A dataframe

Warning

If you have lots of files to read in the same archive, it’s better just to unzip first.

ensure_tar_xml(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, parse_kwargs: Mapping[str, Any] | None = None) lxml.etree.ElementTree[source]

Download a tar file and open an inner file as an XML with lxml.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • inner_path – The relative path to the file inside the archive

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • parse_kwargs – Keyword arguments to pass through to lxml.etree.parse().

Returns:

An ElementTree object

Warning

If you have lots of files to read in the same archive, it’s better just to unzip first.

ensure_untar(*subkeys: str, url: str, name: str | None = None, directory: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, extract_kwargs: Mapping[str, Any] | None = None) Path[source]

Ensure a tar file is downloaded and unarchived.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • directory – Overrides the name of the directory into which the tar archive is extracted. If none given, will use the stem of the file name that gets downloaded.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • extract_kwargs – Keyword arguments to pass to tarfile.TarFile.extract_all().

Returns:

The path of the directory where the file that has been downloaded gets extracted to

ensure_xml(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, parse_kwargs: Mapping[str, Any] | None = None) lxml.etree.ElementTree[source]

Download an XML file and open it with lxml.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • parse_kwargs – Keyword arguments to pass through to lxml.etree.parse().

Returns:

An ElementTree object

Warning

If you have lots of files to read in the same archive, it’s better just to unzip first.

ensure_yaml(*subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, open_kwargs: Mapping[str, Any] | None = None, yaml_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Download YAML and open with yaml.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • open_kwargs – Additional keyword arguments passed to open()

  • yaml_load_kwargs – Keyword arguments to pass through to yaml.safe_load().

Returns:

A JSON object (list, dict, etc.)

ensure_zip_df(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, read_csv_kwargs: Mapping[str, Any] | None = None) pd.DataFrame[source]

Download a zip file and open an inner file as a dataframe with pandas.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • inner_path – The relative path to the file inside the archive

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • read_csv_kwargs – Keyword arguments to pass through to pandas.read_csv().

Returns:

A pandas DataFrame

ensure_zip_np(*subkeys: str, url: str, inner_path: str, name: str | None = None, force: bool = False, download_kwargs: DownloadKwargs | None = None, load_kwargs: Mapping[str, Any] | None = None) numpy.typing.ArrayLike[source]

Download a zip file and open an inner file as an array-like with numpy.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • url – The URL to download.

  • inner_path – The relative path to the file inside the archive

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • force – Should the download be done again, even if the path already exists? Defaults to false.

  • download_kwargs – Keyword arguments to pass through to pystow.utils.download().

  • load_kwargs – Additional keyword arguments that are passed through to read_zip_np() and transitively to numpy.load().

Returns:

An array-like object

classmethod from_key(key: str, *subkeys: str, ensure_exists: bool = True) Module[source]

Get a module for the given directory or one of its subdirectories.

Parameters:
  • key – The name of the module. No funny characters. The envvar <key>_HOME where key is uppercased is checked first before using the default home directory.

  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • ensure_exists – Should all directories be created automatically? Defaults to true.

Returns:

A module

join(*subkeys: str, name: str | None = None, ensure_exists: bool = True, version: None | str | Callable[[], str | None] = None, return_version: Literal[True]) tuple[Path, str | None][source]
join(*subkeys: str, name: str | None = None, ensure_exists: bool = True, version: None | str | Callable[[], str | None] = None, return_version: Literal[False]) Path
join(*subkeys: str, name: str | None = None, ensure_exists: bool = True, version: None | str | Callable[[], str | None] = None) Path

Get a subdirectory of the current module.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • ensure_exists – Should all directories be created automatically? Defaults to true.

  • name – The name of the file (optional) inside the folder

  • version

    The optional version, or no-argument callable that returns an optional version. This is prepended before the subkeys.

    The following example describes how to store the versioned data from the Rhea database for biologically relevant chemical reactions.

    import pystow
    import requests
    
    def get_rhea_version() -> str:
        res = requests.get("https://ftp.expasy.org/databases/rhea/rhea-release.properties")
        _, _, version = res.text.splitlines()[0].partition("=")
        return version
    
    # Assume you want to download the data from
    # ftp://ftp.expasy.org/databases/rhea/rdf/rhea.rdf.gz, make a path
    # with the same name
    module = pystow.module("rhea")
    path = module.join(name="rhea.rdf.gz", version=get_rhea_version)
    

  • return_version – If true, returns the processed version

Returns:

The path of the directory or subdirectory for the given module.

joinpath_sqlite(*subkeys: str, name: str) str[source]

Get an SQLite database connection string.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the database file.

Returns:

A SQLite path string.

load_df(*subkeys: str, name: str, read_csv_kwargs: Mapping[str, Any] | None = None) pd.DataFrame[source]

Open a pre-existing CSV as a dataframe with pandas.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – Overrides the name of the file at the end of the URL, if given. Also useful for URLs that don’t have proper filenames with extensions.

  • read_csv_kwargs – Keyword arguments to pass through to pandas.read_csv().

Returns:

A pandas DataFrame

load_json(*subkeys: str, name: str, open_kwargs: Mapping[str, Any] | None = None, json_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Open a JSON file json.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • open_kwargs – Additional keyword arguments passed to open()

  • json_load_kwargs – Keyword arguments to pass through to json.load().

Returns:

A JSON object (list, dict, etc.)

load_pickle(*subkeys: str, name: str, mode: Literal['rb'] = 'rb', open_kwargs: Mapping[str, Any] | None = None, pickle_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Open a pickle file with pickle.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to open()

  • pickle_load_kwargs – Keyword arguments to pass through to pickle.load().

Returns:

Any object

load_pickle_gz(*subkeys: str, name: str, mode: Literal['rb'] = 'rb', open_kwargs: Mapping[str, Any] | None = None, pickle_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Open a gzipped pickle file with pickle.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to gzip.open()

  • pickle_load_kwargs – Keyword arguments to pass through to pickle.load().

Returns:

Any object

load_rdf(*subkeys: str, name: str | None = None, parse_kwargs: Mapping[str, Any] | None = None) rdflib.Graph[source]

Open an RDF file with rdflib.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • parse_kwargs – Keyword arguments to pass through to pystow.utils.read_rdf() and transitively to rdflib.Graph.parse().

Returns:

An RDF graph

load_xml(*subkeys: str, name: str, parse_kwargs: Mapping[str, Any] | None = None) lxml.etree.ElementTree[source]

Load an XML file with lxml.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • parse_kwargs – Keyword arguments to pass through to lxml.etree.parse().

Returns:

An ElementTree object

Warning

If you have lots of files to read in the same archive, it’s better just to unzip first.

load_yaml(*subkeys: str, name: str, open_kwargs: Mapping[str, Any] | None = None, yaml_load_kwargs: Mapping[str, Any] | None = None) Any[source]

Open a JSON file json.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • open_kwargs – Additional keyword arguments passed to open()

  • yaml_load_kwargs – Keyword arguments to pass through to yaml.safe_load().

Returns:

A JSON object (list, dict, etc.)

module(*subkeys: str, ensure_exists: bool = True) Module[source]

Get a module for a subdirectory of the current module.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • ensure_exists – Should all directories be created automatically? Defaults to true.

Returns:

A module representing the subdirectory based on the given subkeys.

open(*subkeys: str, name: str, mode: Literal['r', 'rt', 'w', 'wt'] = 'r', open_kwargs: Mapping[str, Any] | None = None, ensure_exists: bool) Generator[StringIO, None, None][source]
open(*subkeys: str, name: str, mode: Literal['rb', 'wb'] = 'r', open_kwargs: Mapping[str, Any] | None = None, ensure_exists: bool) Generator[BytesIO, None, None]

Open a file.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • mode – The read mode, passed to open()

  • open_kwargs – Additional keyword arguments passed to open()

  • ensure_exists – Should the directory the file is in be made? Set to true on write operations.

Yields:

An open file object.

This function should be called inside a context manager like in the following

import pystow

with pystow.module("test").open(name="test.tsv", mode="w") as file:
    print("Test text!", file=file)
open_gz(*subkeys: str, name: str, mode: Literal['r', 'w', 'rt', 'wt'] = 'rb', open_kwargs: Mapping[str, Any] | None, ensure_exists: bool) Generator[StringIO, None, None][source]
open_gz(*subkeys: str, name: str, mode: Literal['rb', 'wb'] = 'rb', open_kwargs: Mapping[str, Any] | None, ensure_exists: bool) Generator[BytesIO, None, None]

Open a gzipped file that exists already.

Parameters:
  • subkeys – A sequence of additional strings to join. If none are given, returns the directory for this module.

  • name – The name of the file to open

  • mode – The read mode, passed to gzip.open()

  • open_kwargs – Additional keyword arguments passed to gzip.open()

  • ensure_exists – Should the file be made? Set to true on write operations.

Yields:

An open file object