Data Process

fedgraph.data_process.GC_rand_split_chunk(graphs: list, num_trainer: int = 10, overlap: bool = False, seed: int = 42) list[source]

Randomly split graphs into chunks for each trainer.

Parameters:
  • graphs (list) – The list of graphs.

  • num_trainer (int) – The number of trainers.

  • overlap (bool) – Whether trainers have overlapped data.

  • seed (int) – Seed for randomness.

Returns:

graphs_chunks – The list of chunks for each trainer.

Return type:

list

fedgraph.data_process.NC_load_data(dataset_str: str) tuple[source]

Loads input data from ‘gcn/data’ directory and processes these datasets into a format suitable for training GCN and similar models.

Parameters:

dataset_str (Name of the dataset to be loaded.)

Returns:

  • features (torch.Tensor) – Node feature matrix as a float tensor.

  • adj (torch.Tensor or torch_sparse.tensor.SparseTensor) – Adjacency matrix of the graph.

  • labels (torch.Tensor) – Labels of the nodes.

  • idx_train (torch.LongTensor) – Indices of training nodes.

  • idx_val (torch.LongTensor) – Indices of validation nodes.

  • idx_test (torch.LongTensor) – Indices of test nodes.

Note

ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object; ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object; ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict object; ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.

All objects above must be saved using python pickle module.

fedgraph.data_process.NC_parse_index_file(filename: str) list[source]

Reads and parses an index file

Parameters:

filename (str) – Name or path of the file to parse.

Returns:

index – List of integers, each integer in the list represents int of the lines of the input file.

Return type:

list

fedgraph.data_process.data_loader(args: AttriDict) Any[source]

Load data for federated learning tasks.

Parameters:

args (attridict) – The configuration of the task.

Returns:

data – The data for the task.

Return type:

Any

Note

The function will call the corresponding data loader function based on the task. If the task is “NC”, the function will call data_loader_NC. If the task is “GC”, the function will call data_loader_GC. If the task is “LP”, only the country code needs to be specified at this stage, and the function will return None.

fedgraph.data_process.data_loader_GC(args: AttriDict) dict[source]

Load data for graph classification tasks.

Parameters:

args (attridict) – The configuration of the task.

Returns:

data – The data for the task.

Return type:

dict

fedgraph.data_process.data_loader_GC_multiple(datapath: str, dataset_group: str = 'small', batch_size: int = 32, convert_x: bool = False, seed: int = 42) dict[source]

Graph Classification: prepare data for a group of datasets to multiple trainers.

Parameters:
  • datapath (str) – The input path of data.

  • dataset_group (str) – The name of dataset group.

  • batch_size (int) – The batch size for graph classification.

  • convert_x (bool) – Whether to convert node features to one-hot degree.

  • seed (int) – Seed for randomness.

Returns:

splited_data – The data for each trainer.

Return type:

dict

fedgraph.data_process.data_loader_GC_single(datapath: str, dataset: str = 'PROTEINS', num_trainer: int = 10, batch_size: int = 128, convert_x: bool = False, seed: int = 42, overlap: bool = False) dict[source]

Graph Classification: prepare data for one dataset to multiple trainers.

Parameters:
  • datapath (str) – The input path of data.

  • dataset (str) – The name of dataset that should be available in the TUDataset.

  • num_trainer (int) – The number of trainers.

  • batch_size (int) – The batch size for graph classification.

  • convert_x (bool) – Whether to convert node features to one-hot degree.

  • seed (int) – Seed for randomness.

  • overlap (bool) – Whether trainers have overlapped data.

Returns:

splited_data – The data for each trainer.

Return type:

dict

fedgraph.data_process.data_loader_NC(args: AttriDict) tuple[source]
fedgraph.data_process.download_file_from_github(url: str, save_path: str)[source]
Downloads a file from a GitHub URL and saves it to a specified local path.

Note