Working with Data Sources
This guide explains how to use the DataSourcesClient
to interact with OpenAIRE's data source information (e.g., repositories, journals, aggregators). You'll learn how to fetch individual data sources, search for them using various filters, and iterate over large result sets.
Accessing the Client
The DataSourcesClient
is accessed via an AireloomSession
instance:
import asyncio
from aireloom import AireloomSession
from bibliofabric.auth import NoAuth # Or your preferred auth strategy
async def main():
async with AireloomSession(auth_strategy=NoAuth()) as session:
# You can now access the data sources client
ds_client = session.data_sources
# ... use ds_client to make calls ...
print("DataSourcesClient is ready.")
if __name__ == "__main__":
asyncio.run(main())
Fetching a Single Data Source
To retrieve a specific data source by its OpenAIRE ID, use the get()
method.
import asyncio
from aireloom import AireloomSession
from bibliofabric.auth import NoAuth
from bibliofabric.exceptions import NotFoundError, BibliofabricError
# Example OpenAIRE ID for a data source
# These IDs are often prefixed like 'openaire____::datasourceId:'
DS_ID = "openaire____::datasourceId:doaj" # Directory of Open Access Journals
# DS_ID = "openaire____::datasourceId:zenodo" # Zenodo Repository
async def fetch_single_data_source():
async with AireloomSession(auth_strategy=NoAuth()) as session:
try:
print(f"Attempting to fetch data source with ID: {DS_ID}")
data_source = await session.data_sources.get(DS_ID)
print(f"\nSuccessfully fetched data source:")
print(f" ID: {data_source.id}")
print(f" Official Name: {data_source.officialName}")
print(f" English Name: {data_source.englishName if data_source.englishName else 'N/A'}")
if data_source.type:
print(f" Type: {data_source.type.value} (Scheme: {data_source.type.scheme})")
print(f" Website URL: {data_source.websiteUrl if data_source.websiteUrl else 'N/A'}")
print(f" OpenAIRE Compatibility: {data_source.openaireCompatibility if data_source.openaireCompatibility else 'N/A'}")
if data_source.contentTypes:
print(f" Content Types: {', '.join(data_source.contentTypes)}")
except NotFoundError:
print(f"Error: Data source with ID '{DS_ID}' not found.")
except BibliofabricError as e:
print(f"An Aireloom error occurred: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
if __name__ == "__main__":
asyncio.run(fetch_single_data_source())
The data_source
object returned is an instance of the DataSource
Pydantic model.
Searching Data Sources
To search for data sources based on various criteria, use the search()
method. This method supports pagination, sorting, and filtering.
import asyncio
from math import ceil
from aireloom import AireloomSession
from bibliofabric.auth import NoAuth
from aireloom.endpoints import DataSourcesFilters # Import the filter model
from bibliofabric.exceptions import ValidationError, BibliofabricError
async def search_data_sources_example():
async with AireloomSession(auth_strategy=NoAuth()) as session:
try:
print("Searching for data sources that are 'Journal' type and compatible with OpenAIRE...")
# Define filters using the DataSourcesFilters model
filters = DataSourcesFilters(
dataSourceTypeName="Journal", # Filter by data source type name
openaireCompatibility="openaire_guidelines_3.0_literature_repositories" # Example compatibility level
# officialName="Zenodo" # Example: filter by official name
# subjects=["biology"] # Example: filter by subject
)
# Perform the search
search_response = await session.data_sources.search(
filters=filters,
page=1, # Page number (1-indexed)
page_size=5, # Number of results per page
sort_by="relevance desc" # Sort by relevance (currently the only sort option for data sources)
)
header = search_response.header
results = search_response.results
total_results = header.numFound if header.numFound is not None else 0
page_size = header.pageSize if header.pageSize is not None else 5
total_pages = ceil(total_results / page_size) if page_size > 0 else 0
print(f"\nFound {total_results} data sources matching criteria.")
if total_results > 0:
print(f"Displaying page 1 of {total_pages} (approx.):")
if results:
for i, ds in enumerate(results):
print(f" Result {i+1}:")
print(f" Official Name: {ds.officialName}")
ds_type = ds.type.value if ds.type and ds.type.value else "N/A"
print(f" Type: {ds_type}")
print(f" ID: {ds.id}")
else:
print(" No data sources found for this page/filter combination.")
except ValidationError as e:
print(f"Validation error during search: {e}")
except BibliofabricError as e:
print(f"An Aireloom error occurred during search: {e}")
except Exception as e:
print(f"An unexpected error occurred during search: {e}")
if __name__ == "__main__":
asyncio.run(search_data_sources_example())
Filters (DataSourcesFilters
)
The filters
parameter takes an instance of DataSourcesFilters
from aireloom.endpoints
. Key filter fields include:
* search
: General keyword search.
* officialName
: Filter by official name.
* englishName
: Filter by English name.
* legalShortName
: Filter by legal short name (often of the owning organization).
* id
: Filter by OpenAIRE data source ID.
* pid
: Filter by a persistent identifier value.
* subjects
: List of subject keywords.
* dataSourceTypeName
: Filter by the type of data source (e.g., "Journal", "Repository", "Aggregator").
* contentTypes
: List of content types hosted (e.g., "publication", "dataset").
* openaireCompatibility
: Filter by OpenAIRE compatibility level string.
* relOrganizationId
: Filter by related organization ID.
* relCommunityId
: Filter by related community ID.
* relCollectedFromDatasourceId
: Filter by the data source ID from which this data source was collected (for metadata aggregations).
Refer to the DataSourcesFilters
model definition in aireloom.endpoints
for a complete list.
Sorting (sort_by
)
Currently, the primary valid sort field for data sources is:
* relevance
Format: "relevance asc"
or "relevance desc"
.
Response (DataSourceResponse
)
The search()
method returns a DataSourceResponse
object (ApiResponse[DataSource]
), containing:
* header
: A Header
object with pagination info.
* results
: A list of DataSource
model instances.
Iterating Over All Data Sources
To process all data sources matching criteria without manual pagination, use the iterate()
method.
import asyncio
from aireloom import AireloomSession
from bibliofabric.auth import NoAuth
from aireloom.endpoints import DataSourcesFilters
from bibliofabric.exceptions import ValidationError, BibliofabricError
async def iterate_all_data_sources():
async with AireloomSession(auth_strategy=NoAuth()) as session:
print("Iterating through 'Repository' type data sources...")
count = 0
max_results_to_show = 10 # Limit for example display
try:
filters = DataSourcesFilters(
dataSourceTypeName="Repository"
)
async for ds in session.data_sources.iterate(
filters=filters,
page_size=20, # How many to fetch per underlying API call
# sort_by="relevance desc" # Optional: sorting by relevance
):
count += 1
print(f" #{count}: {ds.officialName} (ID: {ds.id})")
if count >= max_results_to_show:
print(f"\nStopping iteration early after fetching {max_results_to_show} results for this example.")
break
print(f"\nFinished iterating. Total data sources processed in this run (up to limit): {count}")
except ValidationError as e:
print(f"Validation error during iteration: {e}")
except BibliofabricError as e:
print(f"An Aireloom error occurred during iteration: {e}")
except Exception as e:
print(f"An unexpected error occurred during iteration: {e}")
if __name__ == "__main__":
asyncio.run(iterate_all_data_sources())
The DataSource
Model
The DataSource
Pydantic model (defined in aireloom.models.data_source
) provides structured access to data source information. Key attributes include:
id
(str): The OpenAIRE ID of the data source.originalIds
(Optional[list[str]]): List of original IDs from other systems.pids
(Optional[list[ControlledField]]): List of persistent identifiers (e.g., OAI-PMH base URL).ControlledField
hasscheme
andvalue
.type
(Optional[ControlledField]): The type of data source (e.g., repository, journal).ControlledField
hasscheme
andvalue
.openaireCompatibility
(Optional[str]): OpenAIRE compatibility level.officialName
(Optional[str]): The official name.englishName
(Optional[str]): The English name, if different.websiteUrl
(Optional[str]): URL of the data source's website.logoUrl
(Optional[str]): URL of the data source's logo.dateOfValidation
(Optional[str]): Date of last validation by OpenAIRE.description
(Optional[str]): Description of the data source.subjects
(Optional[list[str]]): List of main subjects covered.languages
(Optional[list[str]]): List of languages of content.contentTypes
(Optional[list[str]]): List of content types (e.g., "publication", "dataset", "software").accessRights
(Optional[Literal["open", "restricted", "closed"]]): Default access rights.policies
(Optional[list[str]]): Links to policies.journal
(Optional[Container]): If the data source is a journal, this field (fromaireloom.models.research_product.Container
) may contain journal-specific metadata like ISSNs.
Refer to the aireloom.models.data_source.DataSource
model definition for all available fields.