privkit.data package#

class privkit.data.DataType#

Bases: ABC

DataType is an abstract class for a generic type of data. Defines a series of methods common to all data types. Provides basic functions to load, process, and save data. Requires the definition of the DATA_TYPE_ID, DATA_TYPE_NAME, and DATA_TYPE_INFO.

property DATA_TYPE_ID: str#

Identifier of the data type

property DATA_TYPE_INFO: str#

Information of the data type, specifically the format of the files to be read

property DATA_TYPE_NAME: str#

Name of the data type

abstract load_data(*args)#

Loads data. This is specific to the data type

abstract process_data(*args)#

Performs data processing or returns data processing methods. This is specific to the data type

abstract save_data(*args)#

Saves data to a file. This is specific to the data type

class privkit.data.FacialData(id_name: str | None = None)#

Bases: DataType

FacialData is a privkit.DataType to handle facial data. Facial data is defined as a collection of points in 3D space, where each point is represented by its coordinates (x, y, z) and, optionally, additional attributes like color or normal vectors. It is stored as an Open3D data structure with the PointCloud class.

DATA_TYPE_ID = 'facial_data'#
DATA_TYPE_INFO = 'Facial data can be imported through an Open3D data structure or read by a PLY file. To be supported, data should contain at least one point (x, y, z).'#
DATA_TYPE_NAME = 'Facial Data'#
crop_pcd(xmin: float, xmax: float, ymin: float, ymax: float, zmin: float, zmax: float)#

Segment the point cloud with a bounding box

Parameters:
  • xmin (float) – minimum x-coordinate of the bounding box

  • xmax (float) – maximum x-coordinate of the bounding box

  • ymin (float) – minimum y-coordinate of the bounding box

  • ymax (float) – maximum y-coordinate of the bounding box

  • zmin (float) – minimum z-coordinate of the bounding box

  • zmax (float) – maximum z-coordinate of the bounding box

data#

Facial data is stored as an Open3D data structure

fp_downsample(N: int)#

Downsample the point cloud with the Farthest Point Sampling technique

Parameters:

N (int) – number of point of the sampled point cloud

get_color()#

Returns a boolean value indicating whether the point cloud has color :return: True if the point cloud has color, False otherwise

get_number_of_points()#

Returns the number of points of the point cloud :return: number of points of the point cloud

get_point_mean_std()#

Returns the average coordinate of the point cloud points along with the standard deviation :return: Average point cloud coordinate and standard deviation

get_point_median()#

Returns the median coordinate of the point cloud points :return: median point cloud coordinate

id_name#

Identifier name of this facial data instance

load_data(pcd_or_filepath: PointCloud)#

Loads facial data from a PointCloud, an array with dimensions (#dim, 3) or a file that can be read using Open3D’s read_point_cloud() method.

Parameters:

pcd_or_filepath (DataFrame or str or Path) – either a PointCloud instance, a file path to a point cloud, or an array with the points coordinates

print_data_summary()#

Prints data summary, specifying the number of points, and color availability

process_data()#

Performs data processing or returns data processing methods. This is specific to the data type

remove_outliers_statistical(nb_neighbors: float, std_ratio: float)#

Remove outlier from the point cloud based on neighboring distance

Parameters:
  • nb_neighbors (float) – number of neighbors for outlier detection

  • std_ratio (float) – standard deviation for threshold computation

remove_points_outside_sphere(center: ndarray, radius: float)#

Segment the point cloud with a sphere

Parameters:
  • center (ndarray) – coordinates of the center of the sphere

  • radius (float) – radius of the sphere

save_data(filepath: str = './input/data/', filename: str | None = None, extension: str = 'ply')#

Saves data to a file.

Parameters:
  • filepath (str) – path where data should be saved.

  • filename (str) – name of the file to be saved.

  • extension (str) – extension of the format of how the file should be saved. The default value is ‘ply’.

class privkit.data.LocationData(id_name: str | None = None)#

Bases: DataType

LocationData is a privkit.DataType to handle location data. Location data is defined by a <latitude, longitude> coordinates (and optionally datetime) and is stored as a Pandas DataFrame.

DATA_TYPE_ID = 'location_data'#
DATA_TYPE_INFO = 'Location data can be imported through a Pandas dataframe or a read by a delimited file, a file-like object or an object. To be supported, data should contain at least one point (latitude, longitude). '#
DATA_TYPE_NAME = 'Location Data'#
average_update_rate()#

Returns the average of update rate (i.e. timedelta between subsequent points) :return: average update rate

compute_timedelta(time_unit: str = 's', boxplot: bool = False) List#

Computes the timedelta between the datetime of the trajectory points

Parameters:
  • time_unit (str) – time unit (e.g. seconds, minutes or hours). The default is s (seconds).

  • boxplot (bool) – if True, a boxplot is generated

Returns:

list of timedelta values

create_grid(min_lat: float, max_lat: float, min_lon: float, max_lon: float, spacing: float, timestamp: int | None = None)#

Discretizes the space defined by the min and max latitude and longitude of the location data

Parameters:
  • min_lat (float) – minimum latitude coordinate

  • max_lat (float) – maximum latitude coordinate

  • min_lon (float) – minimum longitude coordinate

  • max_lon (float) – maximum longitude coordinate

  • spacing (float) – grid cell spacing in meters

  • timestamp (int) – time interval

data#

Location data is stored as a pd.DataFrame

divide_data(test_size: float = 0.2)#

Divides data into train and test :param test_size: size of test data

filter_by_distance(min_distance: float = 0, max_distance: float = 2000)#

Filters trajectories by distance to avoid either extremely long or short trajectories

Parameters:
  • min_distance (float) – minimum distance that the trajectory must have. The default is 0 meters.

  • max_distance (float) – maximum distance that the trajectory must have. The default is 2000 meters = 2 km.

filter_by_duration(min_duration: float = 60, max_duration: float = 7200, time_unit: str = 's')#

Filters trajectories by duration to avoid either extremely long or short trajectories

Parameters:
  • min_duration (float) – minimum duration that the trajectory must have. The default is 60 seconds.

  • max_duration (float) – maximum duration that the trajectory must have. The default is 2 hours = 2x3600 seconds.

  • time_unit (str) – time unit (e.g. seconds, minutes or hours). The default is s (seconds).

filter_by_timedelta(timedelta: float, time_unit: str = 's')#

Filters trajectories by timedelta to avoid discontinuity between points

Parameters:
  • timedelta (float) – defines the maximum timedelta between subsequent points

  • time_unit (str) – time unit (e.g. seconds, minutes or hours). The default is s (seconds).

filter_outside_points(min_latitude: float, max_latitude: float, min_longitude: float, max_longitude: float)#

Filters all location points that fall outside the given latitude/longitude grid/bounding-box. Note: this can produce time gaps

Parameters:
  • min_latitude (float) – minimum latitude coordinate

  • max_latitude (float) – maximum latitude coordinate

  • min_longitude (float) – minimum longitude coordinate

  • max_longitude (float) – maximum longitude coordinate

generate_ground_truth(mechanism: str, G: MultiDiGraph | None = None, sigma: float = 6.86, error_range: float = 50, lambda_y: float = 0.69, lambda_z: float = 13.35)#

Generates ground truth by applying a mechanism.

For the Map-Matching mechanism, the following references were used:

[1] Goh, C. Y., Dauwels, J., Mitrovic, N., Asif, M. T., Oran, A., & Jaillet, P. (2012, September). Online map-matching based on hidden markov model for real-time traffic sensing applications. In 2012 15th International IEEE Conference on Intelligent Transportation Systems (pp. 776-781). IEEE.

[2] Jagadeesh, G. R., & Srikanthan, T. (2017). Online map-matching of noisy and sparse location data with hidden Markov and route choice models. IEEE Transactions on Intelligent Transportation Systems, 18(9), 2423-2434.

Parameters:
  • mechanism (str) – mechanism identifier that should be used to generate ground truth data

  • G (networkx.MultiDiGraph) – road network represented as a directed graph

  • sigma (float) – standard deviation of the location measurement noise in meters. The default value is 6.86 [1].

  • error_range (float) – defines the range to search for states for each observation point. The default value is 50 [1].

  • lambda_y (float) – multiplier of circuitousness used to compute the probability of transition. The default value is 0.69 [2].

  • lambda_z (float) – multiplier of temporal implausibility used to compute the probability of transition. The default value is 13.35 [2].

get_bounding_box_range()#

Returns the minimum and maximum latitude/longitude cover by data

Returns:

min_latitude, max_latitude, min_longitude, max_longitude

get_datetime_range()#

Returns the data range of datatime :return: minimum datetime, maximum datetime

get_number_of_trajectories()#

Returns the number of trajectories :return: number of trajectories

get_number_of_users()#

Returns the number of users :return: number of users

get_test_data()#

Returns test data :return: test data

get_train_data()#

Returns train data :return: train data

get_trajectories(user_id: int | None = None, trajectory_id: int | None = None)#

Gets trajectories from the given location data grouped by trajectory id and/or user id.

Parameters:
  • user_id (int) – user identifier whose trajectories should be returned.

  • trajectory_id (int) – trajectory identifier to return

Returns:

trajectories

Return type:

pd.DataFrameGroupBy

get_trajectory_statistics()#

Returns trajectory statistics, specifically the total number of points, trajectory with the minimum and maximum number of points, and the average of points per trajectory.

Returns:

total_number_of_points, min_number_of_points, max_number_of_points, average_number_of_points

id_name#

Identifier name of this location data instance

load_data(data_to_load: DataFrame, latitude: str = 'lat', longitude: str = 'lon', datetime: str = 'datetime', user_id: str = 'uid', trajectory_id: str = 'tid', save: bool = False, **kwargs)#

Loads location data from a pd.DataFrame or a file that can be read by a Pandas read() method. The definition of the parameters came from the parameters of Pandas.

Parameters:
  • data_to_load (DataFrame or str or Path or object) – either a Pandas Dataframe, a path to a file (str or Path) or any object that can be read from pandas read() methods.

  • latitude (str or int) – the position or the name of the column containing the latitude. The default is constants.LATITUDE.

  • longitude (str or int) – the position or the name of the column containing the longitude. The default is constants.LONGITUDE.

  • datetime (str or int) – the position or the name of the column containing the datetime. The default is constants.DATETIME.

  • user_id (str or int) – the position or the name of the column containing the user id. The default is constants.UID.

  • trajectory_id (str or int) – the position or the name of the column containing the trajectory id. The default is constants.TID.

  • save (bool) – if True, data is saved to a file. The default is False.

  • kwargs – parameters for the Pandas read methods.

mm_data_processing(G: MultiDiGraph, sigma: float = 6.86, error_range: float = 50, lambda_y: float = 0.69, lambda_z: float = 13.35)#

Applies map-matching as a pre-processing method to generate the ground thruth. The default values came from the following papers:

[1] Goh, C. Y., Dauwels, J., Mitrovic, N., Asif, M. T., Oran, A., & Jaillet, P. (2012, September). Online map-matching based on hidden markov model for real-time traffic sensing applications. In 2012 15th International IEEE Conference on Intelligent Transportation Systems (pp. 776-781). IEEE.

[2] Jagadeesh, G. R., & Srikanthan, T. (2017). Online map-matching of noisy and sparse location data with hidden Markov and route choice models. IEEE Transactions on Intelligent Transportation Systems, 18(9), 2423-2434.

Parameters:
  • G (networkx.MultiDiGraph) – road network represented as a directed graph

  • sigma (float) – standard deviation of the location measurement noise in meters. The default value is 6.86 [1].

  • error_range (float) – defines the range to search for states for each observation point. The default value is 50 [1].

  • lambda_y (float) – multiplier of circuitousness used to compute the probability of transition. The default value is 0.69 [2].

  • lambda_z (float) – multiplier of temporal implausibility used to compute the probability of transition. The default value is 13.35 [2].

original_data#

Original location data is stored as a pd.DataFrame

print_data_summary(dataset_name: str | None = None)#

Prints data summary, specifying the number of users, trajectories, and other statistics

Parameters:

dataset_name (str) – dataset name (optional). The default value is None.

print_statistics_by_user(dataset_name: str | None = None)#

Prints statistics of data by user, specifying the number of trajectories, points, and other statistics per user

Parameters:

dataset_name (str) – dataset name (optional). The default value is None.

process_data()#

Performs location data processing. The first step consists of sorting location data by datetime (if user id and trajectory id are columns of the dataframe, it first sorts location data by uid and/or tid).

resample(user_id: int, start_index: int, end_index: int)#

Given two indexes, calculates the center coordinates as the mean of all report within the interval given by the indexes.

Parameters:
  • user_id (int) – user which reports are being resampled

  • start_index – starting index of the time interval

  • end_index – ending index of the time interval

Returns:

mean of the reports in the range of the two given indexes

resample_by_time(R: int)#

Resample the location reports on the time axis to solve the non-uniformity distribution over time. Within each resampling interval R, calculates the center coordinates as the mean of all reports within that interval.

Parameters:

R (int) – resampling interval in minutes

save_data(filepath: str = './input/data/', filename: str | None = None, extension: str = 'pkl')#

Saves data to a file.

Parameters:
  • filepath (str) – path where data should be saved.

  • filename (str) – name of the file to be saved.

  • extension (str) – extension of the format of how the file should be saved. The default value is ‘pkl’.

subsample_data(min_timedelta: float, save_subsampled_data: bool = False)#

Subsamples data according to a minimum timedelta between subsequent points

Parameters:
  • min_timedelta (float) – minimum timedelta between subsequent points

  • save_subsampled_data (bool) – if True, subsampled data is saved to a file