Data Process

fedgraph.data_process.GC_rand_split_chunk(graphs: list, num_trainer: int = 10, overlap: bool = False, seed: int = 42) → list[source]

Randomly split graphs into chunks for each trainer.

Parameters:

graphs (list) – The list of graphs.
num_trainer (int) – The number of trainers.
overlap (bool) – Whether trainers have overlapped data.
seed (int) – Seed for randomness.

Returns:

graphs_chunks – The list of chunks for each trainer.

Return type:

list

fedgraph.data_process.NC_load_data(dataset_str: str) → tuple[source]

Loads input data from ‘gcn/data’ directory and processes these datasets into a format suitable for training GCN and similar models.

Parameters:

dataset_str (Name of the dataset to be loaded.)

Returns:

features (torch.Tensor) – Node feature matrix as a float tensor.
adj (torch.Tensor or torch_sparse.tensor.SparseTensor) – Adjacency matrix of the graph.
labels (torch.Tensor) – Labels of the nodes.
idx_train (torch.LongTensor) – Indices of training nodes.
idx_val (torch.LongTensor) – Indices of validation nodes.
idx_test (torch.LongTensor) – Indices of test nodes.

Note

ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object; ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object; ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict object; ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.

All objects above must be saved using python pickle module.

fedgraph.data_process.NC_parse_index_file(filename: str) → list[source]

Reads and parses an index file

Parameters:: filename (str) – Name or path of the file to parse.
Returns:: index – List of integers, each integer in the list represents int of the lines of the input file.
Return type:: list

fedgraph.data_process.data_loader(args: AttriDict) → Any[source]

Load data for federated learning tasks.

Parameters:: args (attridict) – The configuration of the task.
Returns:: data – The data for the task.
Return type:: Any

Note

The function will call the corresponding data loader function based on the task. If the task is “NC”, the function will call data_loader_NC. If the task is “GC”, the function will call data_loader_GC. If the task is “LP”, only the country code needs to be specified at this stage, and the function will return None.

fedgraph.data_process.data_loader_GC(args: AttriDict) → dict[source]

Load data for graph classification tasks.

Parameters:: args (attridict) – The configuration of the task.
Returns:: data – The data for the task.
Return type:: dict

fedgraph.data_process.data_loader_GC_multiple(datapath: str, dataset_group: str = 'small', batch_size: int = 32, convert_x: bool = False, seed: int = 42) → dict[source]

Graph Classification: prepare data for a group of datasets to multiple trainers.

Parameters:

datapath (str) – The input path of data.
dataset_group (str) – The name of dataset group.
batch_size (int) – The batch size for graph classification.
convert_x (bool) – Whether to convert node features to one-hot degree.
seed (int) – Seed for randomness.

Returns:

splited_data – The data for each trainer.

Return type:

dict

fedgraph.data_process.data_loader_GC_single(datapath: str, dataset: str = 'PROTEINS', num_trainer: int = 10, batch_size: int = 128, convert_x: bool = False, seed: int = 42, overlap: bool = False) → dict[source]

Graph Classification: prepare data for one dataset to multiple trainers.

Parameters:

datapath (str) – The input path of data.
dataset (str) – The name of dataset that should be available in the TUDataset.
num_trainer (int) – The number of trainers.
batch_size (int) – The batch size for graph classification.
convert_x (bool) – Whether to convert node features to one-hot degree.
seed (int) – Seed for randomness.
overlap (bool) – Whether trainers have overlapped data.

Returns:

splited_data – The data for each trainer.

Return type:

dict

fedgraph.data_process.data_loader_NC(args: AttriDict) → tuple[source]

fedgraph.data_process.download_file_from_github(url: str, save_path: str)[source]

Downloads a file from a GitHub URL and saves it to a specified local path.: Note