app.components.api_tools package
Submodules
app.components.api_tools.apitools module
- app.components.api_tools.apitools.get_files_newer_than(directory, date_str, days, namefilter=None)[source]
Gets files newer than a given date in a dictionary, only going back a given number of days. Uses os.path.getmtime to establish file dates.
- Parameters:
directory (
str) – where to searchdate – base date for the search, e.g. today
days (
int) – how many days move the cutoff date to the past from the date parameternamefilter (
str) – only return files, which have this string in their name
- Return type:
list- Returns:
list of files newer than date-days, that optionally contain namefilter in their name.
- app.components.api_tools.apitools.get_nbibfile(databasename)[source]
Fetches the nbib reference file for the given database. Does not check, if the file exists.
- Parameters:
databasename (
str) – name of the database- Return type:
str- Returns:
path to the nbib file for that database.
- app.components.api_tools.apitools.get_newest_file(directory, namefilter=None)[source]
Finds the latest modified file in a directory
- Parameters:
directory (
str) – where to searchnamefilter (
str) – only consider files with this in their name
- Return type:
str- Returns:
name of the newest file, not the full path. If no files are found, returns a randomized string.
- app.components.api_tools.apitools.get_pub_ref(databasename)[source]
Fetches reference information for the given database
- Parameters:
databasename (
str) – which database to get information for- Return type:
list- Returns:
list of reference information containing: - short description - long description - PMID
- app.components.api_tools.apitools.get_save_location(databasename)[source]
Finds the appropriate location to save a database file
- Parameters:
databasename – Name of the database
- Return type:
str- Returns:
path to the database directory, where files should be written.
- app.components.api_tools.apitools.get_timestamp()[source]
Returns standardized timestamp.
- Return type:
str- Returns:
standardized timestamp for current day as a string.
app.components.api_tools.biogrid module
BioGRID interaction data utilities.
This module provides helpers to download, parse and manage BioGRID protein-protein interaction data and expose them as normalized tab-separated files and pandas DataFrames.
Main entry points:
- update: checks remote and updates the local BioGRID cache if needed
- get_latest: loads the latest cached data as a DataFrame
- methods_text: returns a ready-to-use methods description
Notes
The processing pipelines operate on large datasets and therefore use chunked reading and on-disk sharding (by interaction prefix) to reduce memory pressure.
- app.components.api_tools.biogrid.deduplicate_str_dataframe_by_index(df)[source]
Deduplicate string values per index by merging rows.
For each index value, string-like columns are split by
;, deduplicated, sorted, and then re-joined with;.- Parameters:
df – Input DataFrame with potentially duplicated index rows.
- Return type:
DataFrame- Returns:
DataFrame where rows sharing an index are merged.
- app.components.api_tools.biogrid.do_update(save_dir, save_zipname, latest_zip_url, uniprots_to_get, organisms=None)[source]
Download, extract and convert the latest BioGRID archive.
- Parameters:
save_dir (
str) – Directory where the data files will be placed.save_zipname (
str) – Zip filename to write undersave_dir.latest_zip_url (
str) – URL for the BioGRID zip resource.uniprots_to_get (
set|None) – Optional set of UniProt accessions to retain.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
None- Returns:
None
- app.components.api_tools.biogrid.filter_chunk(df, uniprots_to_get, organisms=None)[source]
Filter and normalize a BioGRID chunk.
Keeps only physical interactions with valid UniProt accessions, applies optional organism and UniProt filters, derives helper columns, and prepares processed variants for downstream merging.
- Parameters:
df (
DataFrame) – Raw chunk as read from BioGRID TAB3 file.uniprots_to_get (
set|None) – Optional set of UniProt accessions to retain (isoforms allowed; matching is applied on base accession).organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
tuple[DataFrame,list[str],list[str]]- Returns:
Tuple of (filtered DataFrame, normalized column names list, interactor-specific column names list).
- app.components.api_tools.biogrid.generate_pandas(file_path, uniprots_to_get, organisms=None)[source]
Convert a BioGRID TAB3 file into normalized TSV shards on disk.
The file is read in chunks, filtered and normalized, and then expanded into symmetric interaction rows which are sharded by interaction prefix for deduplication and memory efficiency.
- Parameters:
file_path (
str) – Path to the downloaded.tab3.txtfile.uniprots_to_get (
set|None) – Optional set of UniProt accessions to retain.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
None- Returns:
None
- app.components.api_tools.biogrid.get_available()[source]
List available interaction shard names for the latest version.
- Return type:
list[str]- Returns:
List of shard basenames without the
.tsvsuffix.
- app.components.api_tools.biogrid.get_final_df(chunk_df, time_format='%Y/%m/%d')[source]
Aggregate sharded chunk into a final deduplicated DataFrame.
Combines semicolon-delimited values per interaction key, and manages creation/update dates by carrying forward dates from a previous version when values are unchanged.
- Parameters:
chunk_df (
DataFrame) – DataFrame assembled from shard files for a prefix.time_format (
str) – Output date format string.
- Return type:
DataFrame- Returns:
Final deduplicated DataFrame indexed by
interaction.
- app.components.api_tools.biogrid.get_latest(organisms=None, subset_letter=None, name_only=False, since_date=None)[source]
Fetch the latest BioGRID data from the local cache.
- Parameters:
organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.subset_letter (
str|None) – Restrict to shards whose filename starts with this letter.name_only (
bool) – IfTrue, return the latest folder path instead of loading data.since_date (
str|None) – Not implemented yet; reserved for future use.
- Return type:
DataFrame|str- Returns:
DataFrame of the latest BioGRID data or a path string when
name_only=True.
- app.components.api_tools.biogrid.get_method_annotation()[source]
Return annotations for BioGRID interaction identification methods.
- Return type:
dict- Returns:
Mapping from method name to description text.
- app.components.api_tools.biogrid.get_upid(upcol)[source]
Normalize UniProt IDs contained in a Series.
Removes placeholder/null tokens and converts empty results to
NaN.- Parameters:
upcol (
Series) – Series containing UniProt accessions as strings.- Return type:
Series- Returns:
Cleaned Series with empty values converted to
NaN.
- app.components.api_tools.biogrid.handle_and_split_save(df, folder_path, normcols, swcols)[source]
Expand interactor-specific rows and shard to disk by prefix.
Produces a symmetric representation (A->B and B->A) and writes records into prefix-based TSV shards in
folder_path.- Parameters:
df (
DataFrame) – Filtered and normalized chunk DataFrame.folder_path (
str) – Output directory for shard files.normcols (
list[str]) – List of normalized, non-interactor-specific columns.swcols (
list[str]) – List of interactor-specific columns (will be expanded for both interactors).
- Return type:
None- Returns:
None
- app.components.api_tools.biogrid.methods_text()[source]
Generate a plain-text description of the BioGRID data used.
- Return type:
str- Returns:
Multi-line string including source, version and citation.
- app.components.api_tools.biogrid.read_file_chunks(filepath, organisms=None, since_date=None)[source]
Read a TSV shard file in chunks and optionally filter by organism.
- Parameters:
filepath (
str) – Path to a TSV shard.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.since_date (
str|None) – Not implemented yet; reserved for future use.
- Return type:
DataFrame- Returns:
Concatenated DataFrame of the shard file.
- app.components.api_tools.biogrid.read_folder_chunks(folderpath, organisms=None, subset_letter=None, since_date=None)[source]
Load all TSV shards in a folder, optionally restricting by prefix.
- Parameters:
folderpath (
str) – Directory containing shard files.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.subset_letter (
str|None) – If provided, only files starting with this letter are read.since_date (
str|None) – Not implemented yet; reserved for future use.
- Return type:
DataFrame- Returns:
Concatenated DataFrame of all matching shards.
- app.components.api_tools.biogrid.split_and_save_by_prefix(df, column, num_chars, output_dir, index=False, sep='\\t')[source]
Shard a DataFrame by a column prefix and append to TSV files.
- Parameters:
df (
DataFrame) – Input DataFrame.column (
str) – Column whose prefix is used for grouping.num_chars (
int) – Number of prefix characters to use.output_dir (
str) – Target directory for TSV shards.index (
bool) – Whether to include the index in CSV output.sep (
str) – Field separator for CSV output.
- Return type:
None- Returns:
None
- app.components.api_tools.biogrid.update(version, uniprots_to_get=None, organisms=None)[source]
Update the local BioGRID cache if a newer release is available.
- Parameters:
versions – List of current versions of the BioGRID database.
uniprots_to_get (
set|None) – Optional set of UniProt accessions to retain.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
list[str]- Returns:
New version string.
app.components.api_tools.intact module
IntAct database interaction module for protein-protein interaction data.
This module provides functionality to download, parse, and manage protein interaction data from the IntAct database (https://www.ebi.ac.uk/intact/). It handles:
Automated updates from IntAct’s FTP server
Parsing of PSI-MITAB formatted files
Conversion to pandas DataFrames with standardized column names
Version tracking and data freshness checks
Methods text generation for citations
Main entry points
update: check for and download new IntAct releasesget_latest: retrieve the most recent downloaded data as a DataFramemethods_text: generate citation text for the data source
- app.components.api_tools.intact.do_update(save_file, uniprots_to_get, organisms)[source]
Download (if needed) and convert the IntAct archive to TSV shards.
- Parameters:
save_file – Destination zip path (also used to derive output folder).
uniprots_to_get (
set|None) – Optional set of UniProt accessions to retain.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
None- Returns:
None
- app.components.api_tools.intact.download_intact_ftp(save_file, max_retries=10, retry_delay=30)[source]
Download the IntAct archive over FTP with retries.
- Parameters:
save_file (
str) – Destination file path for the downloaded zip.max_retries (
int) – Maximum number of retry attempts.retry_delay (
int) – Delay between retries in seconds.
- Return type:
Optional[str]- Returns:
Path to the saved file on success,
Noneotherwise.
- app.components.api_tools.intact.filter_chunk(chunk, uniprots_to_get, organisms)[source]
Filter IntAct PSI-MITAB chunk and derive helper columns.
Keeps rows involving UniProt IDs, removes negative interactions, builds UniProt ID lists and isoform/base variants, derives organism fields, and optionally filters by organisms.
- Parameters:
chunk (
DataFrame) – Raw PSI-MITAB chunk as DataFrame.uniprots_to_get (
set|None) – Optional set of UniProt base accessions to retain.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
DataFrame- Returns:
Filtered and minimally normalized DataFrame chunk.
- app.components.api_tools.intact.generate_pandas(file_path, output_name, uniprots_to_get, organisms=None)[source]
Inefficiently generates a pandas dataframe from a given intact zip file (downloaded by update()) and writes it to a .tsv file with the same name as input file path.
- Parameters:
file_path (
str) – path to the downloaded zip fileoutput_name (
str) – path for the output fileuniprots_to_get (
set|None) – set of which uniprots should be included in the written .tsv file. If None, all uniprots will be included.organisms (
set|None) – organisms to filter the data by. This set should contain the organism IDs as strings. If None, all data will be included.
- Return type:
None
- app.components.api_tools.intact.get_available()[source]
List available interaction shard names for the latest IntAct version.
- Return type:
list[str]- Returns:
List of shard basenames without the
.tsvsuffix.
- app.components.api_tools.intact.get_final_df(df_filename)[source]
Deduplicate and finalize a shard file into a consistent DataFrame.
- Parameters:
df_filename – Path to a shard TSV produced by the parser.
- Return type:
DataFrame- Returns:
Final DataFrame indexed by
interactionwith dates reduced.
- app.components.api_tools.intact.get_ids(df, col1, col2, uniprots_to_get)[source]
Extract UniProt accessions from PSI-MITAB identifier columns.
For each row, takes
col1andcol2(pipe-delimited alternates), keeps onlyuniprotkb:identifiers, and optionally filters by a provided set of UniProt base accessions.- Parameters:
df – Input DataFrame chunk.
col1 – Primary ID column name.
col2 – Alternate ID column name (pipe-delimited).
uniprots_to_get (
set|None) – Optional set of accessions to retain.
- Return type:
list[str]- Returns:
List of
;-joined UniProt IDs per row, or|DROP|when empty.
- app.components.api_tools.intact.get_iso(ser)[source]
Convert UniProt isoforms to base accessions for a Series.
- Parameters:
ser (
Series) – Series of;-joined UniProt IDs.- Return type:
list[str]- Returns:
List where each element is a
;-joined string with isoform suffixes removed.
- app.components.api_tools.intact.get_latest(organisms=None, subset_letter=None, name_only=False, since_date=None)[source]
Fetch the latest IntAct data from the local cache.
- Parameters:
organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.subset_letter (
str|None) – Restrict to shards whose filename starts with this letter.name_only (
bool) – IfTrue, return the latest folder path instead of loading data.since_date (
datetime|None) – Keep rows with creation or update date on/after this date.
- Return type:
DataFrame|str- Returns:
DataFrame of the latest IntAct data or a path string when
name_only=True.
- app.components.api_tools.intact.get_org(ser)[source]
Extract and normalize organism TaxIDs from PSI-MITAB Taxid fields.
- Parameters:
ser (
Series) – Series with pipe-delimited Taxid entries (e.g.,taxid:9606(human)).- Return type:
list- Returns:
List of comma-joined unique TaxIDs per element.
- app.components.api_tools.intact.handle_and_split_save(df, temp_dir, sep='\\t')[source]
Expand interactor pairs into symmetric rows and shard to disk.
Normalizes columns, pre-processes multi-value fields, expands all interactor combinations A->B and B->A with isoform/base variants, and writes sharded TSVs by interaction prefix.
- Parameters:
df (
DataFrame) – Filtered and minimally normalized chunk.temp_dir (
str) – Output directory for TSV shards.sep (
str) – TSV separator.
- Return type:
None- Returns:
None
- app.components.api_tools.intact.methods_text()[source]
Generate a plain-text description of the IntAct data used.
- Return type:
str- Returns:
Multi-line string including source, version and citation.
- app.components.api_tools.intact.only_latest_date(ser, time_format='%Y/%m/%d')[source]
Reduce semicolon-joined dates to their latest value.
- Parameters:
ser (
Series) – Series of date strings, possibly semicolon-joined.time_format (
str) – Output date format string.
- Return type:
Series- Returns:
Series with only the latest date per element.
- app.components.api_tools.intact.read_file_chunks(filepath, organisms=None, subset_letter=None, since_date=None)[source]
Read a TSV shard in chunks and optionally filter by organism/date.
- Parameters:
filepath (
str) – Path to shard TSV file.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.subset_letter (
str|None) – If provided, only rows whose interaction starts with this letter are retained.since_date (
datetime|None) – Keep rows with creation or update date on/after this date.
- Return type:
DataFrame- Returns:
Concatenated DataFrame of filtered shard content.
- app.components.api_tools.intact.read_folder_chunks(folderpath, organisms=None, subset_letter=None, since_date=None)[source]
Load all TSV shards in a folder, optionally restricting by prefix/date.
- Parameters:
folderpath (
str) – Directory containing shard files.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.subset_letter (
str|None) – Restrict to files whose names start with this letter.since_date (
datetime|None) – Keep rows with creation or update date on/after this date.
- Return type:
DataFrame- Returns:
Concatenated DataFrame of matching shard contents.
- app.components.api_tools.intact.split_and_save_by_prefix(df, column, num_chars, output_dir, index=False, sep='\\t')[source]
Shard a DataFrame by a column prefix and append to TSV files.
- Parameters:
df (
DataFrame) – Input DataFrame.column (
str) – Column whose prefix is used for grouping.num_chars (
int) – Number of prefix characters to use.output_dir (
str) – Target directory for TSV shards.index (
bool) – Whether to include the index in CSV output.sep (
str) – Field separator for CSV output.
- Return type:
None- Returns:
None
- app.components.api_tools.intact.update(version, uniprots_to_get=None, organisms=None)[source]
Update the local IntAct cache if a newer release is available.
- Parameters:
version (
str) – Current version string.uniprots_to_get (
set|None) – Optional set of UniProt base accessions to retain.organisms (
set|None) – Optional set of NCBI TaxIDs (as strings) to retain.
- Return type:
list[str]- Returns:
New version string.
app.components.api_tools.uniprot module
UniProt data access and utilities.
This module provides helpers to download and process UniProtKB data via the public REST API, including pagination handling, column mapping, organism filtering, and convenience wrappers for common datasets used in ProteoGyver.
Main entry points
download_uniprot_chunks: batched download for an organism/field setdownload_full_uniprot_for_organism: comprehensive download builderretrieve_uniprot: cache-aware retrieval to a local TSVupdate: check and refresh local UniProt dataset for an organism
Notes
Large result sets are fetched using server-side pagination with retry policies and optional progress reporting.
- app.components.api_tools.uniprot.download_full_uniprot_for_organism(organism=None, columns=None, progress=False, overall_progress=False, reviewed_only=True)[source]
Downloads the full uniprot database EXCLUDING isoforms in a .tsv format for a given organism.
- Parameters:
organism (
int) – integer ID or a list of integer ID of the desired organism, e.g. human is 9606. If none, defaults to human.columns (
set) – a set of uniprot columns to getprogress (
bool) – Print progress reports of how each batch download is goingoverall_progress (
bool) – Print progress reports when each batch is finishedonly (reviewed) – True, if only reviewed entries should be retrieved
- Return type:
DataFrame- Returns:
pandas dataframe of the uniprot.
- app.components.api_tools.uniprot.download_uniprot_chunks(progress=False, organism=-1, fields=None, reviewed_only=True)[source]
Downloads whole uniprot for a given organism using pagination.
Entry -column will always be the first column and used as the index in the output dataframe.
- Parameters:
progress (
bool) – Progress report printingorganism (
int) – human by default, otherwise specify organism ID (e.g. human is 9606). if -1, all organisms will be downloadedfields (
list|None) – uniprot field labels for fields to retrieve. Refer to https://www.uniprot.org/help/return_fields for help with field Labels (from the label column). If None, download a default selection.reviewed_only (
bool) – if True, only reviewed entries will be returned. if False, all entries will be returned (probably not what you want).
- Return type:
DataFrame- Returns:
the requested uniprot data in a pandas dataframe.
- app.components.api_tools.uniprot.download_uniprot_for_database(versions, organisms)[source]
Checks, whether an update for uniprot is available and downloads it if necessary for database use.
- Parameters:
versions (
list[str]) – List of current versions of the UniProt database.organisms (
set|None) – Optional set of organism IDs to include.progress – True, if progress should be printed.
- Return type:
tuple[DataFrame,list[str]]- Returns:
Tuple of (DataFrame of UniProt data, List of new versions).
- app.components.api_tools.uniprot.download_uniprot_pagination_url(pag_url, headers, progress)[source]
Handles the downloading of the uniprot
- Parameters:
pag_url (
str) – url for uniprotheaders (
list) – list of headers to getprogress (
bool) – True, if progress should be printed
- Return type:
DataFrame- Returns:
Pandas dataframe of the uniprot specified by url and headers.
- app.components.api_tools.uniprot.get_default_uniprot_column_map()[source]
Returns a sensible, yet extensive default map of uniprot columns in a dict.
- Return type:
dict
- app.components.api_tools.uniprot.get_uniprot_column_map()[source]
Returns currently understood map of uniprot columns in a dict.
- Return type:
dict
- app.components.api_tools.uniprot.is_newer_available(current_version, organism=9606)[source]
Checks whether newer uniprot version is available
- Parameters:
current_version (
str) – Current version of the uniprot databaseorganism (
int) – which organism uniprot to check.
- Return type:
tuple[bool,str]- Returns:
True, if newer uniprot version is available.
- app.components.api_tools.uniprot.methods_text(organism=9606)[source]
Generates a methods text for used uniprot data
- Return type:
tuple- Returns:
a tuple of (readable reference information (str), PMID (str), uniprot description (str))
- app.components.api_tools.uniprot.retrieve_protein_group(name, query_col='protein_name', reviewed=True)[source]
Utility function to quickly download tsvs describing each common protein class used in ProteoGyver.
- Parameters:
name (
str) – protein group name to search forquery_col (
str) – column to search. Protein name by default.
- Return type:
DataFrame- Returns:
pandas datafrmae of the given protein group
- app.components.api_tools.uniprot.retrieve_uniprot(uniprotfile='Full human uniprot.tsv', **kwargs)[source]
Downloads full uniprot (reviewed entries only) to a file and returns the dataframe.
- Parameters:
uniprotfile (
str) – path of the output filekwargs – kwargs to pass down to download_full_uniprot_for_organism, e.g. to specify which organism or if progress should be reported.
- Return type:
DataFrame- Returns:
full uniprot as a pandas dataframe.
Module contents
API tools subpackage for external data sources.