Artigo Revisado por pares

Asymmetrically frame‐compatible depth video coding

2015; Institution of Engineering and Technology; Volume: 51; Issue: 22 Linguagem: Inglês

10.1049/el.2015.0913

ISSN

1350-911X

Autores

Jui‐Chiu Chiang, Jin-xiao WU,

Tópico(s)

Advanced Data Compression Techniques

Resumo

Electronics LettersVolume 51, Issue 22 p. 1780-1782 Image and vision processing and display technologyFree Access Asymmetrically frame-compatible depth video coding Jui-Chiu Chiang, Corresponding Author Jui-Chiu Chiang rachel@ccu.edu.tw Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, TaiwanSearch for more papers by this authorJain-Ron Wu, Jain-Ron Wu Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, TaiwanSearch for more papers by this author Jui-Chiu Chiang, Corresponding Author Jui-Chiu Chiang rachel@ccu.edu.tw Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, TaiwanSearch for more papers by this authorJain-Ron Wu, Jain-Ron Wu Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, TaiwanSearch for more papers by this author First published: 12 October 2015 https://doi.org/10.1049/el.2015.0913Citations: 1AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract To achieve efficient coding while maintaining the quality of rendered views, a frame-compatible coding scheme for two-view depth video is proposed. The idea is to remove the redundancy within the inter-view depth video before packing and encoding. The content in two views is analysed after performing view warping from the primary view to the secondary view. The uncovered region is defined as the missing region where no corresponding information can be found in the primary view. Only the uncovered region is retained in the secondary view and will be combined with the primary view to form a frame-compatible video. The experimental results indicate that the proposed scheme outperforms the conventional frame-compatible technology in providing better objective and subjective quality. Introduction To reduce the stereo data and also reuse the existing infrastructure and equipment for two-dimensional (2D) video, a frame-compatible stereo video format [1] is usually used. In the frame-compatible stereo format, two views are downsampled spatially or temporally, before packing into a single video. In the literature, only texture video is used for the frame-compatible format. In this Letter, we propose a frame-compatible coding scheme for depth video with two views. We know that the reliability of the depth map is highly relevant in rendering a high-quality virtual view. Hence, how to perform the downsampling while maintaining the preciseness of depth video is the challenge of this Letter. Background A stereoscopic perception can be realised on TV if two eyes receive a pair of images with appropriate disparity. Free view-point television [2] is one goal for the next TV generation. It allows audiences to view the scenes with an expanded view angle. To complete this goal, it is imperative to solve several issues such as scene capture, scene data representation and storage/transmission. The Motion Picture Experts Group and the Joint Collaborative Team on 3D Video Coding (JCT-3V) commenced the standardisation for 3D video coding (3DV) [3] where multi-view images and the corresponding depth images are the format for scene representation. Without transmitting dense views, we can generate many views once the depth information associated with the multi-view images is available at the decoder. Thus, multi-view video plus depth (MVD) is an efficient data format for free-view application. Some works represent MVD [4-6] with mixed resolution. In [5], a global format is proposed where both the multi-view texture and depth video are represented as a base texture/depth plus residual texture/depth. In the case of five views, four residual views are downsampled and packed into a frame before encoding, while the base view is encoded with unchanged resolution. In [6], the depth map is divided into several horizontal and vertical stripes. Then each stripe is downsampled with a determined downscaling factor taking into consideration rate–distortion (R–D) optimisation for the rendered virtual view. Proposed scheme VSRS 3.5 [7] is the standard software to generate the virtual view. First, forward warping is performed to generate the depth map of the target view. Then backward warping is realised to render the texture of the target view. Usually, there are two reference views and two depth maps generated for the target view. These two rendered depth maps are very similar, except in the region of the frame boundary and the occlusion region, because of the huge redundancy in the two reference views. The basic idea in the proposed technique is to remove the redundancy in the two depth video before transmission. In the proposed scheme, one view is chosen as the primary view, the other as the secondary view. The determination of the primary view depends on the position of the virtual view. The procedures of the proposed scheme are described as follows. Find the uncovered region We warp the primary view depth Vp to the secondary view and form . Since the content in Vp and that in the secondary view Vs is not exactly the same, is incomplete and some pixels in have no corresponding information from Vp. These missing pixels in can be classified into two groups. The first group represents the hole due to rounding during the warping process and the second group includes the frame boundary and occlusion region. For example, if the primary view is in the right side, and the virtual view is between the two views, the left frame boundary of has no information to be filled. Similarly, the occlusion occurs in the left object boundary in . The uncovered region we want to find is the second group. In the proposed algorithm, the uncovered region will be found after excluding the hole. If the cameras are arranged in parallel, usually the width of the hole is smaller than two pixels in horizontal direction. On the basis of this observation, for each time instant i, we divide and into several strips, denoted as and , where j represents the strip index and each strip is a collection of several columns. For each strip, if any row in contains consecutive missing pixels wider than one pixel, this strip is recognised as an uncovered strip and the corresponding strip is reserved. Build the uncovered map For each, only the uncovered strips are retained while all the other strips are discarded. Then the uncovered strips are connected to form an uncovered map, denoted as . In a frame-compatible side-by-side (SBS) format, the width of combined two views should be equal to the width of a single frame. In our strategy, the assigned width of the primary view is determined after is obtained. Usually, the width of and may be different, and they may cause two problems. First, the width difference between and will make the consecutive primary view frames downscaled with varying factors and the coding efficiency will be attenuated accordingly. Secondly, the strip index of the uncovered strips needs to be transmitted to the decoder to reconstruct the uncovered region. The overhead will be a burden if each frame has its own strip index to carry. To avoid these two problems, for all frames within a group of pictures (GOPs), the retained strip index is the same, and the index is determined as the union of the index of uncovered strips for all frames in this GOPs. Determine the downscaling factors Once we have built the uncovered map, we will determine the downscaling factor for it. Three cases are discussed here. Assume that the width of a single frame is W. Let the width of the original uncovered map and the width of downscaled uncovered map be Wom and Ws, respectively. In the first case, Wom is smaller than W/4, and no downscaling is performed for the uncovered map. In the other case, if Wom is larger than W/4, the uncovered map is downscaled to W/4 or W/2 according to (1). Then the primary view is downscaled to Wp, where Wp = W − Ws (1)In the conventional SBS format, each view is equally downscaled to W/2 using a downscaling factor of 2. In our proposed scheme, if Wom is very small, Wp can be larger than W/2 and a better reconstruction quality can be achieved for the primary view. In case 1, we have a widerWp, and Wp ≥ 3/4 W. In case 2, Ws = W/4 and Wp = 3/4W. It means that the downscaling factor for the uncovered map is smaller than 2. Also, it ensures the reconstruction quality for the uncovered map, and the primary view is better than those in the conventional method. In case 3, the quality of the primary view is the same as that in the conventional method due to the same downscaling factor. However, because Wom is always smaller than W, and the scaling factor is smaller than 2, the uncovered map can be better reconstructed compared with those in the conventional SBS format. Pack the downscaled views into a frame Finally, the downscaled primary view and optionally downscaled uncovered map is packed into a frame. The primary view is put in the left part in order to avoid the possible coding loss due to the inconsistent temporal structure of the uncovered map between different GOPs. View synthesis At the decoder, the primary view and the uncovered region in the secondary view are reconstructed individually, and the upsampling is accomplished with the filter in scalable video coding [8]. Note that we do not completely reconstruct the depth map of the secondary view since depth maps are used for view synthesis, not for direct watching. To generate the depth map of the virtual view, there are two reference sources. Three cases are discussed for the reference source selection. In the first case, the virtual view is only visible from one reference view and it will be definitely rendered from this view. In the second case, the virtual view is visible from both views; we will choose one view for reference. Taking an example to illustrate our method, assume that the primary view is on the right side, and the virtual view is between the two views. The primary view provides more reliable information for the right object boundary in the virtual view. Thus, we determine the reference source after identifying the boundary direction. We warp the reconstructed primary view to the virtual view, denoted as Vpv, and then derive the boundary information from Vpv. If the current pixel (x, y) belongs to the right side object boundary, the right view will be chosen as the reference source, as shown in (2). In the third case, the pixel in the virtual view is invisible from both reference views and the inpainting technique employed in VSRS 3.5 is used (2) Experimental results To evaluate the performance of the proposed scheme, four sequences are used using the common test conditions suggested in [9]. The information about these test sequences are summarised in Table 1 and the primary view is expressed in bold. The platform used is JMVC 8.5 [10] and only base view coding is used. GOP is 8 and QPs are 34, 39, 42, 45 and 49. Two schemes are compared with the proposed scheme. The first one is called FC-SBS, which stands for the conventional frame-compatible SBS format. The second one is called FR-SBS (full-resolution SBS), where the two views are encoded by the SBS format without downsampling. Table 1. Test sequences Balloon undo_dancer poznan_str. GT FLY Resolution 1024 × 768 1920 × 1088 1920 × 1088 1920 × 1088 Input 1, 3 1, 5 3, 4 1, 5 Virtual 2.5 4.75 3.75 4.75 Fig 1Open in figure viewerPowerPoint R–D performancea Virtual depth b Virtual texture Table 2. BDBD and BDPSNR of proposed scheme with respect to FR_SBS scheme for virtual depth and texture Sequence BDBR (%) (D|T) BDPSNR (dB) (D|T) undo_dancer −24.95 −6.36 0.78 0.25 GT_FLY −19.46 −5.34 0.73 0.21 poznan_str. −28.53 −21.04 0.56 0.32 balloon −22.65 −8.80 0.75 0.15 Table 3. BDBD and BDPSNR of proposed scheme with respect to FC_SBS scheme for virtual depth and texture Sequence BDBR (%) (D|T) BDPSNR (dB) (D|T) undo_dancer −13.13 −24.50 0.44 0.49 GT_FLY −7.63 −10.83 0.34 0.17 poznan_str. −15.57 −14.15 0.20 0.23 balloon −15.45 −9.82 0.38 0.16 Fig 2Open in figure viewerPowerPoint Rendered depth image of virtual view at QP = 42 a FC-SBS b Proposed Fig. 1 shows the R–D performance for the depth video and texture video of the virtual view for the test sequence 'undo_dancer' where the texture video of reference views are not encoded, for better demonstrating the influence of the compressed depth video on the virtual view. The ground truth is the image rendered by uncompressed reference views. Fig. 1 indicates that the proposed scheme always outperforms the conventional FC-SBS scheme. Besides, the proposed scheme is superior to the FR-SBS scheme in the low bit rate scenario. The R–D performance for the virtual view is also evaluated by the Bjontegaard metric [11] with respect to the FR-SBS and FC_SBS schemes, as summarised in Tables 2 and 3, respectively, for both depth (D) and texture (T) of the virtual view. The proposed scheme achieves significant bit rate saving over the FR-SBS scheme. Fig. 2 presents the clipped virtual depth map. As expected, the quality of the proposed scheme is better than that of the FC-SBS scheme, which leads to a virtual texture with improved quality. Conclusion In this Letter, a frame-compatible depth video coding scheme is proposed. In the proposed scheme, the secondary view preserves only the uncovered region and removes the redundancy regarding the primary view before downscaling. This strategy guarantees better reconstruction of both views. The experimental results confirm that the proposed scheme outperforms the full-resolution scheme, as well as the conventional frame-compatible scheme. References 1Vetro, A.: ' Frame compatible formats for 3D video distribution'. IEEE Int. Conf. Image Processing, Hong Kong, September 2010, pp. 2405– 2408 2Tanimoto, M.: ' Overview of FTV (free-viewpoint television)'. IEEE Int. Conf. Multimedia and Expo, New York, NY, USA, July 2009, pp. 1552– 1553 3 'Applications and Requirements on 3D Video Coding', ISO/IEC JTC1/SC29/WG11 MPEG2011, N11829, 2011 4 'AHG08: Technical Description of GVD (Global View and Depth) 3D Format,' ISO/IEC/JTC1/SC29/WG11/JCT3V, B0075, 2012 5 'AHG08: Revision of Global View and Depth Format Description for Depth View SEI Message,' ISO/IEC/JTC1/SC29/WG11/JCT3V, E0106, 2013 6Homayouni, M., Aminlou, A., Aflaki, P. et. al.,: ' Content adaptive depth map resampling scheme in multiview video plus depth'. IEEE Int. Symp. Circuits and Systems, Melbourne, Australia, June 2014, pp. 538– 541 7 ISO/IEC JTC1/SC29/WG11, MPEG, View Synthesis Software Manual, September 2009, release 3.5 8Schwarz, H., Marpe, D., Wiegand, T.: 'Overview of the scalable video coding extension of the H.264/AVC standard', IEEE Trans. Circuits Syst. Video Technol., 2007, 17, (9), pp. 1103– 1120 (https://doi/org/10.1109/TCSVT.2007.905532) 9Müller, K., Vetro, A.: 'Common test conditions of 3DV core experiments', ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 14, G1100, 2014 10 JMVC 8.5, garcon.ient.rwthaachen.de, Sep. 2011 11Bjontegaard, G.: ' Calculation of average PSNR differences between RD-curves'. 13th VCEG-M33 Meeting, Austin, TX, USA, April 2001 Citing Literature Volume51, Issue22October 2015Pages 1780-1782 FiguresReferencesRelatedInformation

Referência(s)
Altmetric
PlumX