torchvision.ops — Torchvision master documentation (2024)

Docs >
torchvision.ops

torchvision.ops implements operators that are specific for Computer Vision.

Note

All operators have native support for TorchScript.

torchvision.ops.nms(boxes: torch.Tensor, scores: torch.Tensor, iou_threshold: float) → torch.Tensor[source]¶

Performs non-maximum suppression (NMS) on the boxes accordingto their intersection-over-union (IoU).

NMS iteratively removes lower scoring boxes which have anIoU greater than iou_threshold with another (higher scoring)box.

If multiple boxes have the exact same score and satisfy the IoUcriterion with respect to a reference box, the selected box isnot guaranteed to be the same between CPU and GPU. This is similarto the behavior of argsort in PyTorch when repeated values are present.

Parameters:	boxes (Tensor[N, 4])) – boxes to perform NMS on. Theyare expected to be in `(x1, y1, x2, y2)` format with `0 <= x1 < x2` and`0 <= y1 < y2`. scores (Tensor[N]) – scores for each one of the boxes iou_threshold (float) – discards all overlapping boxes with IoU > iou_threshold
Returns:	int64 tensor with the indices of the elements that have been keptby NMS, sorted in decreasing order of scores
Return type:	keep (Tensor)

torchvision.ops.batched_nms(boxes: torch.Tensor, scores: torch.Tensor, idxs: torch.Tensor, iou_threshold: float) → torch.Tensor[source]¶

Performs non-maximum suppression in a batched fashion.

Each index value correspond to a category, and NMSwill not be applied between elements of different categories.

Parameters:	boxes (Tensor[N, 4]) – boxes where NMS will be performed. Theyare expected to be in `(x1, y1, x2, y2)` format with `0 <= x1 < x2` and`0 <= y1 < y2`. scores (Tensor[N]) – scores for each one of the boxes idxs (Tensor[N]) – indices of the categories for each one of the boxes. iou_threshold (float) – discards all overlapping boxes with IoU > iou_threshold
Returns:	int64 tensor with the indices of the elements that have been kept by NMS, sortedin decreasing order of scores
Return type:	keep (Tensor)

torchvision.ops.remove_small_boxes(boxes: torch.Tensor, min_size: float) → torch.Tensor[source]¶

Remove boxes which contains at least one side smaller than min_size.

Parameters:	boxes (Tensor[N, 4]) – boxes in `(x1, y1, x2, y2)` formatwith `0 <= x1 < x2` and `0 <= y1 < y2`. min_size (float) – minimum size
Returns:	indices of the boxes that have both sides larger than min_size
Return type:	keep (Tensor[K])

torchvision.ops.clip_boxes_to_image(boxes: torch.Tensor, size: Tuple[int, int]) → torch.Tensor[source]¶

Clip boxes so that they lie inside an image of size size.

Parameters:	boxes (Tensor[N, 4]) – boxes in `(x1, y1, x2, y2)` formatwith `0 <= x1 < x2` and `0 <= y1 < y2`. size (Tuple[height, width]) – size of the image
Returns:	clipped_boxes (Tensor[N, 4])

torchvision.ops.box_convert(boxes: torch.Tensor, in_fmt: str, out_fmt: str) → torch.Tensor[source]¶

Converts boxes from given in_fmt to out_fmt.Supported in_fmt and out_fmt are:

‘xyxy’: boxes are represented via corners, x1, y1 being top left and x2, y2 being bottom right.

‘xywh’ : boxes are represented via corner, width and height, x1, y2 being top left, w, h being width and height.

‘cxcywh’ : boxes are represented via centre, width and height, cx, cy being center of box, w, hbeing width and height.

Parameters:	boxes (Tensor[N, 4]) – boxes which will be converted. in_fmt (str) – Input format of given boxes. Supported formats are [‘xyxy’, ‘xywh’, ‘cxcywh’]. out_fmt (str) – Output format of given boxes. Supported formats are [‘xyxy’, ‘xywh’, ‘cxcywh’]
Returns:	Boxes into converted format.
Return type:	boxes (Tensor[N, 4])

torchvision.ops.box_area(boxes: torch.Tensor) → torch.Tensor[source]¶

Computes the area of a set of bounding boxes, which are specified by its(x1, y1, x2, y2) coordinates.

Parameters:	boxes (Tensor[N, 4]) – boxes for which the area will be computed. Theyare expected to be in (x1, y1, x2, y2) format with`0 <= x1 < x2` and `0 <= y1 < y2`.
Returns:	area for each box
Return type:	area (Tensor[N])

torchvision.ops.box_iou(boxes1: torch.Tensor, boxes2: torch.Tensor) → torch.Tensor[source]¶

Return intersection-over-union (Jaccard index) of boxes.

Both sets of boxes are expected to be in (x1, y1, x2, y2) format with0 <= x1 < x2 and 0 <= y1 < y2.

Parameters:	boxes1 (Tensor[N, 4]) – boxes2 (Tensor[M, 4]) –
Returns:	the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2
Return type:	iou (Tensor[N, M])

torchvision.ops.generalized_box_iou(boxes1: torch.Tensor, boxes2: torch.Tensor) → torch.Tensor[source]¶

Return generalized intersection-over-union (Jaccard index) of boxes.

Both sets of boxes are expected to be in (x1, y1, x2, y2) format with0 <= x1 < x2 and 0 <= y1 < y2.

Parameters:	boxes1 (Tensor[N, 4]) – boxes2 (Tensor[M, 4]) –
Returns:	the NxM matrix containing the pairwise generalized_IoU valuesfor every element in boxes1 and boxes2
Return type:	generalized_iou (Tensor[N, M])

torchvision.ops.roi_align(input: torch.Tensor, boxes: torch.Tensor, output_size: None, spatial_scale: float = 1.0, sampling_ratio: int = -1, aligned: bool = False) → torch.Tensor[source]¶

Performs Region of Interest (RoI) Align operator described in Mask R-CNN

Parameters:

Parameters:	input (Tensor[N, C, H, W]) – input tensor boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy `0 <= x1 < x2` and `0 <= y1 < y2`.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width) spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0 sampling_ratio (int) – number of sampling points in the interpolation gridused to compute the output value of each pooled output bin. If > 0,then exactly sampling_ratio x sampling_ratio grid points are used. If<= 0, then an adaptive number of grid points are used (computed asceil(roi_width / pooled_w), and likewise for height). Default: -1 aligned (bool) – If False, use the legacy implementation.If True, pixel shift it by -0.5 for align more perfectly about two neighboring pixel indices.This version in Detectron2
Returns:	output (Tensor[K, C, output_size[0], output_size[1]])

input (Tensor[N, C, H, W]) – input tensor
boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy 0 <= x1 < x2 and 0 <= y1 < y2.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch
output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width)
spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0
sampling_ratio (int) – number of sampling points in the interpolation gridused to compute the output value of each pooled output bin. If > 0,then exactly sampling_ratio x sampling_ratio grid points are used. If<= 0, then an adaptive number of grid points are used (computed asceil(roi_width / pooled_w), and likewise for height). Default: -1
aligned (bool) – If False, use the legacy implementation.If True, pixel shift it by -0.5 for align more perfectly about two neighboring pixel indices.This version in Detectron2

Returns:

output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.ps_roi_align(input: torch.Tensor, boxes: torch.Tensor, output_size: int, spatial_scale: float = 1.0, sampling_ratio: int = -1) → torch.Tensor[source]¶

Performs Position-Sensitive Region of Interest (RoI) Align operatormentioned in Light-Head R-CNN.

Parameters:

Parameters:	input (Tensor[N, C, H, W]) – input tensor boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy `0 <= x1 < x2` and `0 <= y1 < y2`.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width) spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0 sampling_ratio (int) – number of sampling points in the interpolation gridused to compute the output value of each pooled output bin. If > 0then exactly sampling_ratio x sampling_ratio grid points are used.If <= 0, then an adaptive number of grid points are used (computed asceil(roi_width / pooled_w), and likewise for height). Default: -1
Returns:	output (Tensor[K, C, output_size[0], output_size[1]])

input (Tensor[N, C, H, W]) – input tensor
boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy 0 <= x1 < x2 and 0 <= y1 < y2.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch
output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width)
spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0
sampling_ratio (int) – number of sampling points in the interpolation gridused to compute the output value of each pooled output bin. If > 0then exactly sampling_ratio x sampling_ratio grid points are used.If <= 0, then an adaptive number of grid points are used (computed asceil(roi_width / pooled_w), and likewise for height). Default: -1

Returns:

output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.roi_pool(input: torch.Tensor, boxes: torch.Tensor, output_size: None, spatial_scale: float = 1.0) → torch.Tensor[source]¶

Performs Region of Interest (RoI) Pool operator described in Fast R-CNN

Parameters:

Parameters:	input (Tensor[N, C, H, W]) – input tensor boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy `0 <= x1 < x2` and `0 <= y1 < y2`.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width) spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0
Returns:	output (Tensor[K, C, output_size[0], output_size[1]])

input (Tensor[N, C, H, W]) – input tensor
boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy 0 <= x1 < x2 and 0 <= y1 < y2.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch
output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width)
spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0

Returns:

output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.ps_roi_pool(input: torch.Tensor, boxes: torch.Tensor, output_size: int, spatial_scale: float = 1.0) → torch.Tensor[source]¶

Performs Position-Sensitive Region of Interest (RoI) Pool operatordescribed in R-FCN

Parameters:

Parameters:	input (Tensor[N, C, H, W]) – input tensor boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy `0 <= x1 < x2` and `0 <= y1 < y2`.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width) spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0
Returns:	output (Tensor[K, C, output_size[0], output_size[1]])

input (Tensor[N, C, H, W]) – input tensor
boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2)format where the regions will be taken from.The coordinate must satisfy 0 <= x1 < x2 and 0 <= y1 < y2.If a single Tensor is passed,then the first column should contain the batch index. If a list of Tensorsis passed, then each Tensor will correspond to the boxes for an element iin a batch
output_size (int or Tuple[int, int]) – the size of the output after the croppingis performed, as (height, width)
spatial_scale (float) – a scaling factor that maps the input coordinates tothe box coordinates. Default: 1.0

Returns:

output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.deform_conv2d(input: torch.Tensor, offset: torch.Tensor, weight: torch.Tensor, bias: Union[torch.Tensor, NoneType] = None, stride: Tuple[int, int] = (1, 1), padding: Tuple[int, int] = (0, 0), dilation: Tuple[int, int] = (1, 1), mask: Union[torch.Tensor, NoneType] = None) → torch.Tensor[source]¶

Performs Deformable Convolution v2, described inDeformable ConvNets v2: More Deformable, Better Results if mask is not None andPerforms Deformable Convolution, described inDeformable Convolutional Networks if mask is None.

Parameters:	input (Tensor[batch_size, in_channels, in_height, in_width]) – input tensor *(Tensor[batch_size, 2 offset_groups * kernel_height * kernel_width,** (offset) – out_height, out_width]): offsets to be applied for each position in theconvolution kernel. weight (Tensor[out_channels, in_channels // groups, kernel_height, kernel_width]) – convolution weights, split into groups of size (in_channels // groups) bias (Tensor[out_channels]) – optional bias of shape (out_channels,). Default: None stride (int or Tuple[int, int]) – distance between convolution centers. Default: 1 padding (int or Tuple[int, int]) – height/width of padding of zeroes aroundeach image. Default: 0 dilation (int or Tuple[int, int]) – the spacing between kernel elements. Default: 1 *(Tensor[batch_size, offset_groups kernel_height * kernel_width,** (mask) – out_height, out_width]): masks to be applied for each position in theconvolution kernel. Default: None
Returns:	result of convolution
Return type:	output (Tensor[batch_sz, out_channels, out_h, out_w])

Examples::

>>> input = torch.rand(4, 3, 10, 10)>>> kh, kw = 3, 3>>> weight = torch.rand(5, 3, kh, kw)>>> # offset and mask should have the same spatial size as the output>>> # of the convolution. In this case, for an input of 10, stride of 1>>> # and kernel size of 3, without padding, the output size is 8>>> offset = torch.rand(4, 2 * kh * kw, 8, 8)>>> mask = torch.rand(4, kh * kw, 8, 8)>>> out = deform_conv2d(input, offset, weight, mask=mask)>>> print(out.shape)>>> # returns>>>  torch.Size([4, 5, 8, 8])

torchvision.ops.sigmoid_focal_loss(inputs: torch.Tensor, targets: torch.Tensor, alpha: float = 0.25, gamma: float = 2, reduction: str = 'none')[source]¶

Original implementation from https://github.com/facebookresearch/fvcore/blob/master/fvcore/nn/focal_loss.py .Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002.

Parameters:	inputs – A float tensor of arbitrary shape.The predictions for each example. targets – A float tensor with the same shape as inputs. Stores the binaryclassification label for each element in inputs(0 for the negative class and 1 for the positive class). alpha – (optional) Weighting factor in range (0,1) to balancepositive vs negative examples or -1 for ignore. Default = 0.25 gamma – Exponent of the modulating factor (1 - p_t) tobalance easy vs hard examples. reduction – ‘none’ \| ‘mean’ \| ‘sum’‘none’: No reduction will be applied to the output.‘mean’: The output will be averaged.‘sum’: The output will be summed.
Returns:	Loss tensor with the reduction option applied.

Parameters:

inputs – A float tensor of arbitrary shape.The predictions for each example.
targets – A float tensor with the same shape as inputs. Stores the binaryclassification label for each element in inputs(0 for the negative class and 1 for the positive class).
alpha – (optional) Weighting factor in range (0,1) to balancepositive vs negative examples or -1 for ignore. Default = 0.25
gamma – Exponent of the modulating factor (1 - p_t) tobalance easy vs hard examples.
reduction – ‘none’ | ‘mean’ | ‘sum’‘none’: No reduction will be applied to the output.‘mean’: The output will be averaged.‘sum’: The output will be summed.

Returns:

Loss tensor with the reduction option applied.

class torchvision.ops.RoIAlign(output_size: None, spatial_scale: float, sampling_ratio: int, aligned: bool = False)[source]¶: See roi_align

class torchvision.ops.PSRoIAlign(output_size: int, spatial_scale: float, sampling_ratio: int)[source]¶: See ps_roi_align

class torchvision.ops.RoIPool(output_size: None, spatial_scale: float)[source]¶: See roi_pool

class torchvision.ops.PSRoIPool(output_size: int, spatial_scale: float)[source]¶: See ps_roi_pool

class torchvision.ops.DeformConv2d(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = True)[source]¶: See deform_conv2d

class torchvision.ops.MultiScaleRoIAlign(featmap_names: List[str], output_size: Union[int, Tuple[int], List[int]], sampling_ratio: int, *, canonical_scale: int = 224, canonical_level: int = 4)[source]¶

Multi-scale RoIAlign pooling, which is useful for detection with or without FPN.

It infers the scale of the pooling via the heuristics specified in eq. 1of the Feature Pyramid Network paper.They keyword-only parameters canonical_scale and canonical_levelcorrespond respectively to 224 and k0=4 in eq. 1, andhave the following meaning: canonical_level is the target level of the pyramid fromwhich to pool a region of interest with w x h = canonical_scale x canonical_scale.

Parameters:	featmap_names (List[str]) – the names of the feature maps that will be usedfor the pooling. output_size (List[Tuple[int, int]] or List[int]) – output size for the pooled region sampling_ratio (int) – sampling ratio for ROIAlign canonical_scale (int, optional) – canonical_scale for LevelMapper canonical_level (int, optional) – canonical_level for LevelMapper

Parameters:

featmap_names (List[str]) – the names of the feature maps that will be usedfor the pooling.
output_size (List[Tuple[int, int]] or List[int]) – output size for the pooled region
sampling_ratio (int) – sampling ratio for ROIAlign
canonical_scale (int, optional) – canonical_scale for LevelMapper
canonical_level (int, optional) – canonical_level for LevelMapper

Examples:

>>> m = torchvision.ops.MultiScaleRoIAlign(['feat1', 'feat3'], 3, 2)>>> i = OrderedDict()>>> i['feat1'] = torch.rand(1, 5, 64, 64)>>> i['feat2'] = torch.rand(1, 5, 32, 32) # this feature won't be used in the pooling>>> i['feat3'] = torch.rand(1, 5, 16, 16)>>> # create some random bounding boxes>>> boxes = torch.rand(6, 4) * 256; boxes[:, 2:] += boxes[:, :2]>>> # original image size, before computing the feature maps>>> image_sizes = [(512, 512)]>>> output = m(i, [boxes], image_sizes)>>> print(output.shape)>>> torch.Size([6, 5, 3, 3])

class torchvision.ops.FeaturePyramidNetwork(in_channels_list: List[int], out_channels: int, extra_blocks: Union[torchvision.ops.feature_pyramid_network.ExtraFPNBlock, NoneType] = None)[source]¶

Module that adds a FPN from on top of a set of feature maps. This is based on“Feature Pyramid Network for Object Detection”.

The feature maps are currently supposed to be in increasing depthorder.

The input to the model is expected to be an OrderedDict[Tensor], containingthe feature maps on top of which the FPN will be added.

Parameters:	in_channels_list (list[int]) – number of channels for each feature map thatis passed to the module out_channels (int) – number of channels of the FPN representation extra_blocks (ExtraFPNBlock or None) – if provided, extra operations willbe performed. It is expected to take the fpn features, the originalfeatures and the names of the original features as input, and returnsa new list of feature maps and their corresponding names

Parameters:

in_channels_list (list[int]) – number of channels for each feature map thatis passed to the module
out_channels (int) – number of channels of the FPN representation
extra_blocks (ExtraFPNBlock or None) – if provided, extra operations willbe performed. It is expected to take the fpn features, the originalfeatures and the names of the original features as input, and returnsa new list of feature maps and their corresponding names

Examples:

>>> m = torchvision.ops.FeaturePyramidNetwork([10, 20, 30], 5)>>> # get some dummy data>>> x = OrderedDict()>>> x['feat0'] = torch.rand(1, 10, 64, 64)>>> x['feat2'] = torch.rand(1, 20, 16, 16)>>> x['feat3'] = torch.rand(1, 30, 8, 8)>>> # compute the FPN on top of x>>> output = m(x)>>> print([(k, v.shape) for k, v in output.items()])>>> # returns>>>  [('feat0', torch.Size([1, 5, 64, 64])),>>>  ('feat2', torch.Size([1, 5, 16, 16])),>>>  ('feat3', torch.Size([1, 5, 8, 8]))]

torchvision.ops

torchvision.ops — Torchvision master documentation (2024)

References