app.components.api_tools package

Submodules

app.components.api_tools.apitools module

app.components.api_tools.apitools.get_files_newer_than(directory, date_str, days, namefilter=None)[source]

Gets files newer than a given date in a dictionary, only going back a given number of days. Uses os.path.getmtime to establish file dates.

Parameters:

directory (str) – where to search
date – base date for the search, e.g. today
days (int) – how many days move the cutoff date to the past from the date parameter
namefilter (str) – only return files, which have this string in their name

Return type:

list

Returns:

list of files newer than date-days, that optionally contain namefilter in their name.

app.components.api_tools.apitools.get_nbibfile(databasename)[source]

Fetches the nbib reference file for the given database. Does not check, if the file exists.

Parameters:: databasename (str) – name of the database
Return type:: str
Returns:: path to the nbib file for that database.

app.components.api_tools.apitools.get_newest_file(directory, namefilter=None)[source]

Finds the latest modified file in a directory

Parameters:

directory (str) – where to search
namefilter (str) – only consider files with this in their name

Return type:

str

Returns:

name of the newest file, not the full path. If no files are found, returns a randomized string.

app.components.api_tools.apitools.get_pub_ref(databasename)[source]

Fetches reference information for the given database

Parameters:: databasename (str) – which database to get information for
Return type:: list
Returns:: list of reference information containing: - short description - long description - PMID

app.components.api_tools.apitools.get_save_location(databasename)[source]

Finds the appropriate location to save a database file

Parameters:: databasename – Name of the database
Return type:: str
Returns:: path to the database directory, where files should be written.

app.components.api_tools.apitools.get_timestamp()[source]

Returns standardized timestamp.

Return type:: str
Returns:: standardized timestamp for current day as a string.

app.components.api_tools.apitools.is_newer(reference, new_date)[source]

Checks whether new_date is newer than reference

Parameters:

reference (str) – reference timestamp string
new_date (str) – new_date timestamp string

Return type:

bool

Returns:

True, if new_date is more recent than reference.

app.components.api_tools.apitools.parse_timestamp_from_str(stamp_text)[source]

Parses standardized time stamp from string. :type stamp_text: str :param stamp_text: str with the timestamp

Return type:: date
Returns:: Datetime for the timestamp

app.components.api_tools.biogrid module

BioGRID interaction data utilities.

This module provides helpers to download, parse and manage BioGRID protein-protein interaction data and expose them as normalized tab-separated files and pandas DataFrames.

Main entry points: - update: checks remote and updates the local BioGRID cache if needed - get_latest: loads the latest cached data as a DataFrame - methods_text: returns a ready-to-use methods description

Notes

The processing pipelines operate on large datasets and therefore use chunked reading and on-disk sharding (by interaction prefix) to reduce memory pressure.

app.components.api_tools.biogrid.deduplicate_str_dataframe_by_index(df)[source]

Deduplicate string values per index by merging rows.

For each index value, string-like columns are split by ;, deduplicated, sorted, and then re-joined with ;.

Parameters:: df – Input DataFrame with potentially duplicated index rows.
Return type:: DataFrame
Returns:: DataFrame where rows sharing an index are merged.

app.components.api_tools.biogrid.do_update(save_dir, save_zipname, latest_zip_url, uniprots_to_get, organisms=None)[source]

Download, extract and convert the latest BioGRID archive.

Parameters:

save_dir (str) – Directory where the data files will be placed.
save_zipname (str) – Zip filename to write under save_dir.
latest_zip_url (str) – URL for the BioGRID zip resource.
uniprots_to_get (set | None) – Optional set of UniProt accessions to retain.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

None

Returns:

None

app.components.api_tools.biogrid.filter_chunk(df, uniprots_to_get, organisms=None)[source]

Filter and normalize a BioGRID chunk.

Keeps only physical interactions with valid UniProt accessions, applies optional organism and UniProt filters, derives helper columns, and prepares processed variants for downstream merging.

Parameters:

df (DataFrame) – Raw chunk as read from BioGRID TAB3 file.
uniprots_to_get (set | None) – Optional set of UniProt accessions to retain (isoforms allowed; matching is applied on base accession).
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

tuple[DataFrame, list[str], list[str]]

Returns:

Tuple of (filtered DataFrame, normalized column names list, interactor-specific column names list).

app.components.api_tools.biogrid.generate_pandas(file_path, uniprots_to_get, organisms=None)[source]

Convert a BioGRID TAB3 file into normalized TSV shards on disk.

The file is read in chunks, filtered and normalized, and then expanded into symmetric interaction rows which are sharded by interaction prefix for deduplication and memory efficiency.

Parameters:

file_path (str) – Path to the downloaded .tab3.txt file.
uniprots_to_get (set | None) – Optional set of UniProt accessions to retain.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

None

Returns:

None

app.components.api_tools.biogrid.get_available()[source]

List available interaction shard names for the latest version.

Return type:: list[str]
Returns:: List of shard basenames without the .tsv suffix.

app.components.api_tools.biogrid.get_final_df(chunk_df, time_format='%Y/%m/%d')[source]

Aggregate sharded chunk into a final deduplicated DataFrame.

Combines semicolon-delimited values per interaction key, and manages creation/update dates by carrying forward dates from a previous version when values are unchanged.

Parameters:

chunk_df (DataFrame) – DataFrame assembled from shard files for a prefix.
time_format (str) – Output date format string.

Return type:

DataFrame

Returns:

Final deduplicated DataFrame indexed by interaction.

app.components.api_tools.biogrid.get_latest(organisms=None, subset_letter=None, name_only=False, since_date=None)[source]

Fetch the latest BioGRID data from the local cache.

Parameters:

organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.
subset_letter (str | None) – Restrict to shards whose filename starts with this letter.
name_only (bool) – If True, return the latest folder path instead of loading data.
since_date (str | None) – Not implemented yet; reserved for future use.

Return type:

DataFrame | str

Returns:

DataFrame of the latest BioGRID data or a path string when name_only=True.

app.components.api_tools.biogrid.get_method_annotation()[source]

Return annotations for BioGRID interaction identification methods.

Return type:: dict
Returns:: Mapping from method name to description text.

app.components.api_tools.biogrid.get_upid(upcol)[source]

Normalize UniProt IDs contained in a Series.

Removes placeholder/null tokens and converts empty results to NaN.

Parameters:: upcol (Series) – Series containing UniProt accessions as strings.
Return type:: Series
Returns:: Cleaned Series with empty values converted to NaN.

app.components.api_tools.biogrid.handle_and_split_save(df, folder_path, normcols, swcols)[source]

Expand interactor-specific rows and shard to disk by prefix.

Produces a symmetric representation (A->B and B->A) and writes records into prefix-based TSV shards in folder_path.

Parameters:

df (DataFrame) – Filtered and normalized chunk DataFrame.
folder_path (str) – Output directory for shard files.
normcols (list[str]) – List of normalized, non-interactor-specific columns.
swcols (list[str]) – List of interactor-specific columns (will be expanded for both interactors).

Return type:

None

Returns:

None

app.components.api_tools.biogrid.methods_text()[source]

Generate a plain-text description of the BioGRID data used.

Return type:: str
Returns:: Multi-line string including source, version and citation.

app.components.api_tools.biogrid.read_file_chunks(filepath, organisms=None, since_date=None)[source]

Read a TSV shard file in chunks and optionally filter by organism.

Parameters:

filepath (str) – Path to a TSV shard.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.
since_date (str | None) – Not implemented yet; reserved for future use.

Return type:

DataFrame

Returns:

Concatenated DataFrame of the shard file.

app.components.api_tools.biogrid.read_folder_chunks(folderpath, organisms=None, subset_letter=None, since_date=None)[source]

Load all TSV shards in a folder, optionally restricting by prefix.

Parameters:

folderpath (str) – Directory containing shard files.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.
subset_letter (str | None) – If provided, only files starting with this letter are read.
since_date (str | None) – Not implemented yet; reserved for future use.

Return type:

DataFrame

Returns:

Concatenated DataFrame of all matching shards.

app.components.api_tools.biogrid.split_and_save_by_prefix(df, column, num_chars, output_dir, index=False, sep='\\t')[source]

Shard a DataFrame by a column prefix and append to TSV files.

Parameters:

df (DataFrame) – Input DataFrame.
column (str) – Column whose prefix is used for grouping.
num_chars (int) – Number of prefix characters to use.
output_dir (str) – Target directory for TSV shards.
index (bool) – Whether to include the index in CSV output.
sep (str) – Field separator for CSV output.

Return type:

None

Returns:

None

app.components.api_tools.biogrid.update(version, uniprots_to_get=None, organisms=None)[source]

Update the local BioGRID cache if a newer release is available.

Parameters:

versions – List of current versions of the BioGRID database.
uniprots_to_get (set | None) – Optional set of UniProt accessions to retain.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

list[str]

Returns:

New version string.

app.components.api_tools.intact module

IntAct database interaction module for protein-protein interaction data.

This module provides functionality to download, parse, and manage protein interaction data from the IntAct database (https://www.ebi.ac.uk/intact/). It handles:

Automated updates from IntAct’s FTP server
Parsing of PSI-MITAB formatted files
Conversion to pandas DataFrames with standardized column names
Version tracking and data freshness checks
Methods text generation for citations

Main entry points

update: check for and download new IntAct releases
get_latest: retrieve the most recent downloaded data as a DataFrame
methods_text: generate citation text for the data source

app.components.api_tools.intact.do_update(save_file, uniprots_to_get, organisms)[source]

Download (if needed) and convert the IntAct archive to TSV shards.

Parameters:

save_file – Destination zip path (also used to derive output folder).
uniprots_to_get (set | None) – Optional set of UniProt accessions to retain.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

None

Returns:

None

app.components.api_tools.intact.download_intact_ftp(save_file, max_retries=10, retry_delay=30)[source]

Download the IntAct archive over FTP with retries.

Parameters:

save_file (str) – Destination file path for the downloaded zip.
max_retries (int) – Maximum number of retry attempts.
retry_delay (int) – Delay between retries in seconds.

Return type:

Optional[str]

Returns:

Path to the saved file on success, None otherwise.

app.components.api_tools.intact.filter_chunk(chunk, uniprots_to_get, organisms)[source]

Filter IntAct PSI-MITAB chunk and derive helper columns.

Keeps rows involving UniProt IDs, removes negative interactions, builds UniProt ID lists and isoform/base variants, derives organism fields, and optionally filters by organisms.

Parameters:

chunk (DataFrame) – Raw PSI-MITAB chunk as DataFrame.
uniprots_to_get (set | None) – Optional set of UniProt base accessions to retain.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

DataFrame

Returns:

Filtered and minimally normalized DataFrame chunk.

app.components.api_tools.intact.generate_pandas(file_path, output_name, uniprots_to_get, organisms=None)[source]

Inefficiently generates a pandas dataframe from a given intact zip file (downloaded by update()) and writes it to a .tsv file with the same name as input file path.

Parameters:

file_path (str) – path to the downloaded zip file
output_name (str) – path for the output file
uniprots_to_get (set | None) – set of which uniprots should be included in the written .tsv file. If None, all uniprots will be included.
organisms (set | None) – organisms to filter the data by. This set should contain the organism IDs as strings. If None, all data will be included.

Return type:

None

app.components.api_tools.intact.get_available()[source]

List available interaction shard names for the latest IntAct version.

Return type:: list[str]
Returns:: List of shard basenames without the .tsv suffix.

app.components.api_tools.intact.get_final_df(df_filename)[source]

Deduplicate and finalize a shard file into a consistent DataFrame.

Parameters:: df_filename – Path to a shard TSV produced by the parser.
Return type:: DataFrame
Returns:: Final DataFrame indexed by interaction with dates reduced.

app.components.api_tools.intact.get_ids(df, col1, col2, uniprots_to_get)[source]

Extract UniProt accessions from PSI-MITAB identifier columns.

For each row, takes col1 and col2 (pipe-delimited alternates), keeps only uniprotkb: identifiers, and optionally filters by a provided set of UniProt base accessions.

Parameters:

df – Input DataFrame chunk.
col1 – Primary ID column name.
col2 – Alternate ID column name (pipe-delimited).
uniprots_to_get (set | None) – Optional set of accessions to retain.

Return type:

list[str]

Returns:

List of ;-joined UniProt IDs per row, or |DROP| when empty.

app.components.api_tools.intact.get_iso(ser)[source]

Convert UniProt isoforms to base accessions for a Series.

Parameters:: ser (Series) – Series of ;-joined UniProt IDs.
Return type:: list[str]
Returns:: List where each element is a ;-joined string with isoform suffixes removed.

app.components.api_tools.intact.get_latest(organisms=None, subset_letter=None, name_only=False, since_date=None)[source]

Fetch the latest IntAct data from the local cache.

Parameters:

organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.
subset_letter (str | None) – Restrict to shards whose filename starts with this letter.
name_only (bool) – If True, return the latest folder path instead of loading data.
since_date (datetime | None) – Keep rows with creation or update date on/after this date.

Return type:

DataFrame | str

Returns:

DataFrame of the latest IntAct data or a path string when name_only=True.

app.components.api_tools.intact.get_org(ser)[source]

Extract and normalize organism TaxIDs from PSI-MITAB Taxid fields.

Parameters:: ser (Series) – Series with pipe-delimited Taxid entries (e.g., taxid:9606(human)).
Return type:: list
Returns:: List of comma-joined unique TaxIDs per element.

app.components.api_tools.intact.handle_and_split_save(df, temp_dir, sep='\\t')[source]

Expand interactor pairs into symmetric rows and shard to disk.

Normalizes columns, pre-processes multi-value fields, expands all interactor combinations A->B and B->A with isoform/base variants, and writes sharded TSVs by interaction prefix.

Parameters:

df (DataFrame) – Filtered and minimally normalized chunk.
temp_dir (str) – Output directory for TSV shards.
sep (str) – TSV separator.

Return type:

None

Returns:

None

app.components.api_tools.intact.methods_text()[source]

Generate a plain-text description of the IntAct data used.

Return type:: str
Returns:: Multi-line string including source, version and citation.

app.components.api_tools.intact.only_latest_date(ser, time_format='%Y/%m/%d')[source]

Reduce semicolon-joined dates to their latest value.

Parameters:

ser (Series) – Series of date strings, possibly semicolon-joined.
time_format (str) – Output date format string.

Return type:

Series

Returns:

Series with only the latest date per element.

app.components.api_tools.intact.read_file_chunks(filepath, organisms=None, subset_letter=None, since_date=None)[source]

Read a TSV shard in chunks and optionally filter by organism/date.

Parameters:

filepath (str) – Path to shard TSV file.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.
subset_letter (str | None) – If provided, only rows whose interaction starts with this letter are retained.
since_date (datetime | None) – Keep rows with creation or update date on/after this date.

Return type:

DataFrame

Returns:

Concatenated DataFrame of filtered shard content.

app.components.api_tools.intact.read_folder_chunks(folderpath, organisms=None, subset_letter=None, since_date=None)[source]

Load all TSV shards in a folder, optionally restricting by prefix/date.

Parameters:

folderpath (str) – Directory containing shard files.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.
subset_letter (str | None) – Restrict to files whose names start with this letter.
since_date (datetime | None) – Keep rows with creation or update date on/after this date.

Return type:

DataFrame

Returns:

Concatenated DataFrame of matching shard contents.

app.components.api_tools.intact.split_and_save_by_prefix(df, column, num_chars, output_dir, index=False, sep='\\t')[source]

Shard a DataFrame by a column prefix and append to TSV files.

Parameters:

df (DataFrame) – Input DataFrame.
column (str) – Column whose prefix is used for grouping.
num_chars (int) – Number of prefix characters to use.
output_dir (str) – Target directory for TSV shards.
index (bool) – Whether to include the index in CSV output.
sep (str) – Field separator for CSV output.

Return type:

None

Returns:

None

app.components.api_tools.intact.update(version, uniprots_to_get=None, organisms=None)[source]

Update the local IntAct cache if a newer release is available.

Parameters:

version (str) – Current version string.
uniprots_to_get (set | None) – Optional set of UniProt base accessions to retain.
organisms (set | None) – Optional set of NCBI TaxIDs (as strings) to retain.

Return type:

list[str]

Returns:

New version string.

app.components.api_tools.uniprot module

UniProt data access and utilities.

This module provides helpers to download and process UniProtKB data via the public REST API, including pagination handling, column mapping, organism filtering, and convenience wrappers for common datasets used in ProteoGyver.

Main entry points

download_uniprot_chunks: batched download for an organism/field set
download_full_uniprot_for_organism: comprehensive download builder
retrieve_uniprot: cache-aware retrieval to a local TSV
update: check and refresh local UniProt dataset for an organism

Notes

Large result sets are fetched using server-side pagination with retry policies and optional progress reporting.

app.components.api_tools.uniprot.download_full_uniprot_for_organism(organism=None, columns=None, progress=False, overall_progress=False, reviewed_only=True)[source]

Downloads the full uniprot database EXCLUDING isoforms in a .tsv format for a given organism.

Parameters:

organism (int) – integer ID or a list of integer ID of the desired organism, e.g. human is 9606. If none, defaults to human.
columns (set) – a set of uniprot columns to get
progress (bool) – Print progress reports of how each batch download is going
overall_progress (bool) – Print progress reports when each batch is finished
only (reviewed) – True, if only reviewed entries should be retrieved

Return type:

DataFrame

Returns:

pandas dataframe of the uniprot.

app.components.api_tools.uniprot.download_uniprot_chunks(progress=False, organism=-1, fields=None, reviewed_only=True)[source]

Downloads whole uniprot for a given organism using pagination.

Entry -column will always be the first column and used as the index in the output dataframe.

Parameters:

progress (bool) – Progress report printing
organism (int) – human by default, otherwise specify organism ID (e.g. human is 9606). if -1, all organisms will be downloaded
fields (list | None) – uniprot field labels for fields to retrieve. Refer to https://www.uniprot.org/help/return_fields for help with field Labels (from the label column). If None, download a default selection.
reviewed_only (bool) – if True, only reviewed entries will be returned. if False, all entries will be returned (probably not what you want).

Return type:

DataFrame

Returns:

the requested uniprot data in a pandas dataframe.

app.components.api_tools.uniprot.download_uniprot_for_database(versions, organisms)[source]

Checks, whether an update for uniprot is available and downloads it if necessary for database use.

Parameters:

versions (list[str]) – List of current versions of the UniProt database.
organisms (set | None) – Optional set of organism IDs to include.
progress – True, if progress should be printed.

Return type:

tuple[DataFrame, list[str]]

Returns:

Tuple of (DataFrame of UniProt data, List of new versions).

app.components.api_tools.uniprot.download_uniprot_pagination_url(pag_url, headers, progress)[source]

Handles the downloading of the uniprot

Parameters:

pag_url (str) – url for uniprot
headers (list) – list of headers to get
progress (bool) – True, if progress should be printed

Return type:

DataFrame

Returns:

Pandas dataframe of the uniprot specified by url and headers.

app.components.api_tools.uniprot.get_default_uniprot_column_map()[source]

Returns a sensible, yet extensive default map of uniprot columns in a dict.

Return type:: dict

app.components.api_tools.uniprot.get_uniprot_column_map()[source]

Returns currently understood map of uniprot columns in a dict.

Return type:: dict

app.components.api_tools.uniprot.is_newer_available(current_version, organism=9606)[source]

Checks whether newer uniprot version is available

Parameters:

current_version (str) – Current version of the uniprot database
organism (int) – which organism uniprot to check.

Return type:

tuple[bool, str]

Returns:

True, if newer uniprot version is available.

app.components.api_tools.uniprot.methods_text(organism=9606)[source]

Generates a methods text for used uniprot data

Return type:: tuple
Returns:: a tuple of (readable reference information (str), PMID (str), uniprot description (str))

app.components.api_tools.uniprot.retrieve_protein_group(name, query_col='protein_name', reviewed=True)[source]

Utility function to quickly download tsvs describing each common protein class used in ProteoGyver.

Parameters:

name (str) – protein group name to search for
query_col (str) – column to search. Protein name by default.

Return type:

DataFrame

Returns:

pandas datafrmae of the given protein group

app.components.api_tools.uniprot.retrieve_uniprot(uniprotfile='Full human uniprot.tsv', **kwargs)[source]

Downloads full uniprot (reviewed entries only) to a file and returns the dataframe.

Parameters:

uniprotfile (str) – path of the output file
kwargs – kwargs to pass down to download_full_uniprot_for_organism, e.g. to specify which organism or if progress should be reported.

Return type:

DataFrame

Returns:

full uniprot as a pandas dataframe.

app.components.api_tools.uniprot.update(version, organism=9606, progress=False)[source]

Checks, whether an update for uniprot is available and downloads it if necessary.

Parameters:

organism – which organism to download
progress – True, if progress should be printed.

Return type:

None

Module contents

API tools subpackage for external data sources.