{"title": "Combining Fully Convolutional and Recurrent Neural Networks for 3D Biomedical Image Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 3036, "page_last": 3044, "abstract": "Segmentation of 3D images is a fundamental problem in biomedical image analysis. Deep learning (DL) approaches have achieved the state-of-the-art segmentation performance. To exploit the 3D contexts using neural networks, known DL segmentation methods, including 3D convolution, 2D convolution on the planes orthogonal to 2D slices, and LSTM in multiple directions, all suffer incompatibility with the highly anisotropic dimensions in common 3D biomedical images. In this paper, we propose a new DL framework for 3D image segmentation, based on a combination of a fully convolutional network (FCN) and a recurrent neural network (RNN), which are responsible for exploiting the intra-slice and inter-slice contexts, respectively. To our best knowledge, this is the first DL framework for 3D image segmentation that explicitly leverages 3D image anisotropism. Evaluating using a dataset from the ISBI Neuronal Structure Segmentation Challenge and in-house image stacks for 3D fungus segmentation, our approach achieves promising results, comparing to the known DL-based 3D segmentation approaches.", "full_text": "Combining Fully Convolutional and Recurrent\n\nNeural Networks for 3D Biomedical Image\n\nSegmentation\n\nJianxu Chen\n\nUniversity of Notre Dame\n\njchen16@nd.edu\n\nLin Yang\n\nUniversity of Notre Dame\n\nlyang5@nd.edu\n\nYizhe Zhang\n\nUniversity of Notre Dame\n\nyzhang29@nd.edu\n\nUniversity of Notre Dame\n\nUniversity of Notre Dame\n\nDanny Z. Chen\n\ndchen@nd.edu\n\nMark Alber\n\nmalber@nd.edu\n\nAbstract\n\nSegmentation of 3D images is a fundamental problem in biomedical image analysis.\nDeep learning (DL) approaches have achieved state-of-the-art segmentation perfor-\nmance. To exploit the 3D contexts using neural networks, known DL segmentation\nmethods, including 3D convolution, 2D convolution on planes orthogonal to 2D\nimage slices, and LSTM in multiple directions, all suffer incompatibility with the\nhighly anisotropic dimensions in common 3D biomedical images. In this paper,\nwe propose a new DL framework for 3D image segmentation, based on a com-\nbination of a fully convolutional network (FCN) and a recurrent neural network\n(RNN), which are responsible for exploiting the intra-slice and inter-slice contexts,\nrespectively. To our best knowledge, this is the \ufb01rst DL framework for 3D image\nsegmentation that explicitly leverages 3D image anisotropism. Evaluating using a\ndataset from the ISBI Neuronal Structure Segmentation Challenge and in-house\nimage stacks for 3D fungus segmentation, our approach achieves promising results\ncomparing to the known DL-based 3D segmentation approaches.\n\n1\n\nIntroduction\n\nIn biomedical image analysis, a fundamental problem is the segmentation of 3D images, to identify\ntarget 3D objects such as neuronal structures [1] and knee cartilage [15]. In biomedical imaging, 3D\nimages often consist of highly anisotropic dimensions [11], that is, the scale of each voxel in depth\n(the z-axis) can be much larger (e.g., 5\u223c 10 times) than that in the xy plane.\nOn various biomedical image segmentation tasks, deep learning (DL) methods have achieved tremen-\ndous success in terms of accuracy (outperforming classic methods by a large margin [4]) and\ngenerality (mostly application-independent [16]). For 3D segmentation, known DL schemes can be\nbroadly classi\ufb01ed into four categories. (I) 2D fully convolutional networks (FCN), such as U-Net\n[16] and DCAN [2], can be applied to each 2D image slice, and 3D segmentation is then generated\nby concatenating the 2D results. (II) 3D convolutions can be employed to replace 2D convolutions\n[10], or combined with 2D convolutions into a hybrid network [11]. (III) Tri-planar schemes (e.g.,\n[15]) apply three 2D convolutional networks based on orthogonal planes (i.e., the xy, yz, and xz\nplanes) to perform voxel classi\ufb01cation. (IV) 3D segmentation can also be conducted by recurrent\nneural networks (RNN). A most representative RNN based scheme is Pyramid-LSTM [18], which\nuses six generalized long short term memory networks to exploit the 3D context.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: An overview of our DL framework for 3D segmentation. There are two key components in\nthe architecture: kU-Net and BDC-LSTM. kU-Net is a type of FCN and is applied to 2D slices to\nexploit intra-slice contexts. BDC-LSTM, a generalized LSTM network, is applied to a sequence of\n2D feature maps, from 2D slice z \u2212 \u03c1 to 2D slice z + \u03c1, extracted by kU-Nets, to extract hierarchical\nfeatures from the 3D contexts. Finally, a softmax function (the green arrows) is applied to the result\nof each slice in order to build the segmentation probability map.\n\nThere are mainly three issues to the known DL-based 3D segmentation methods. First, simply linking\n2D segmentations into 3D cannot leverage the spatial correlation along the z-direction. Second,\nincorporating 3D convolutions may incur extremely high computation costs (e.g., high memory\nconsumption and long training time [10]). Third, both 3D convolution and other circumventive\nsolutions (to reduce intensive computation of 3D convolution), like tri-planar schemes or Pyramid-\nLSTM, perform 2D convolutions with isotropic kernel on anisotropic 3D images. This could be\nproblematic, especially for images with substantially lower resolution in depth (the z-axis). For\ninstance, both the tri-planar schemes and Pyramid-LSTM perform 2D convolutions on the xz and\nyz planes. For two orthogonal one-voxel wide lines in the xz plane, one along the z-direction and\nthe other along the x-direction, they may correspond to two structures at very different scales, and\nconsequently may correspond to different types of objects \u2014 or even may not both correspond\nto objects of interest. But, 2D convolutions on the xz plane with isotropic kernel are not able to\ndifferentiate these two lines. On the other hand, 3D objects of a same type, if rotated in 3D, may have\nvery different appearances in the xz or yz plane. This fact makes the features extracted by such 2D\nisotropic convolutions in the xz or yz plane suffer poor generality (e.g., may cause over\ufb01tting).\nIn common practice, a 3D biomedical image is often represented as a sequence of 2D slices (called\na z-stack). Recurrent neural networks, especially LSTM [8], are an effective model to process\nsequential data [14, 17]. Inspired by these facts, we propose a new framework combining two DL\ncomponents: a fully convolutional network (FCN) to extract intra-slice contexts, and a recurrent\nneural network (RNN) to extract inter-slice contexts. Our framework is based on the following ideas.\nOur FCN component employs a new deep architecture for 2D feature extraction. It aims to ef\ufb01ciently\ncompress the intra-slice information into hierarchical features. Comparing to known FCN for 2D\nbiomedical imaging (e.g., U-Net [16]), our new FCN is considerably more effective in dealing with\nobjects of very different scales by simulating human behaviors in perceiving multi-scale information.\nWe introduce a generalized RNN to exploit 3D contexts, which essentially applies a series of 2D\nconvolutions on the xy plane in a recurrent fashion to interpret 3D contexts while propagating\ncontextual information in the z-direction. Our key idea is to hierarchically assemble intra-slice\ncontexts into 3D contexts by leveraging the inter-slice correlations. The insight is that our RNN can\ndistill 3D contexts in the same spirit as the 2D convolutional neural network (CNN) extracting a\nhierarchy of contexts from a 2D image. Comparing to known RNN models for 3D segmentation,\nsuch as Pyramid-LSTM [18], our RNN model is free of the problematic isotropic convolutions on\nanisotropic images, and can exploit 3D contexts more ef\ufb01ciently by combining with FCN.\nThe essential difference between our new DL framework and the known DL-based 3D segmentation\napproaches is that we explicitly leverage the anisotropism of 3D images and ef\ufb01ciently construct a\nhierarchy of discriminative features from 3D contexts by performing systematic 2D operations. Our\nframework can serve as a new paradigm of migrating 2D DL architectures (e.g., CNN) to effectively\nexploit 3D contexts and solve 3D image segmentation problems.\n\n2 Methodology\n\nA schematic view of our DL framework is given in Fig. 1. This framework is a combination of two\nkey components: an FCN (called kU-Net) and an RNN (called BDC-LSTM), to exploit intra-slice\n\n2\n\n\fFigure 2: Illustrating four different ways to organize k submodule U-Nets in kU-Net (here k = 2).\nU-Net-2 works in a coarser scale (downsampled once from the original image), while U-Net-1 works\nin a \ufb01ner scale (directly cropped from the original image). kU-Net propagates high level information\nextracted by U-Net-2 to U-Net-1. (A) U-Net-1 fuses the output of U-Net-2 in the downsampling\nstream. (B) U-Net-1 fuses the output of U-Net-2 in the upsampling stream. (C) U-Net-1 fuses the\nintermediate result of U-Net-2 in the most abstract layer. (D) U-Net-1 takes every piece of information\nfrom U-Net-2 in the commensurate layers. Architecture (A) is \ufb01nally adopted for kU-Net.\n\nand inter-slice contexts, respectively. Section 2.1 presents the kU-Net, and Section 2.2 introduces the\nderivation of the BDC-LSTM. We then show how to combine these two components in the framework\nto conduct 3D segmentation. Finally, we discuss the training strategy.\n\n2.1 The FCN Component: kU-Net\n\nThe FCN component aims to construct a feature map for each 2D slice, from which object-relevant\ninformation (e.g., texture, shapes) will be extracted and object-irrelevant information (e.g., uneven\nillumination, imaging contrast) will be discarded. By doing so, the next RNN component can\nconcentrate on the inter-slice context.\nA key challenge to the FCN component is the multi-scale issue. Namely, objects in biomedical images,\nspeci\ufb01cally in 2D slices, can have very different scales and shapes. But, the common FCN [13]\nand other known variants for segmenting biomedical images (e.g., U-Net [16]) work on a \ufb01xed-size\nperception \ufb01eld (e.g., a 500 \u00d7 500 region in the whole 2D slice). When objects are of larger scale\nthan the pre-de\ufb01ned perception \ufb01eld size, it can be troublesome for such FCN methods to capture the\nhigh level context (e.g., the overall shapes). In the literature, a multi-stream FCN was proposed in\nProNet [19] to address this multi-scale issue in natural scene images. In ProNet, the same image is\nresized to different scales and fed in parallel to a shared FCN with the same parameters. However,\nthe mechanism of shared parameters may make it not suitable for biomedical images, because objects\nof different scales may have very different appearances and require different FCNs to process.\nWe propose a new FCN architecture to simulate how human experts perceive multi-scale information,\nin which multiple submodule FCNs are employed to work on different image scales systematically.\nHere, we use U-Net [16] as the submodule FCN and call the new architecture kU-Net. U-Net [16] is\nchosen because it is a well-known FCN achieving huge success in biomedical image segmentation.\nU-Net [16] consists of four downsampling steps followed by four upsampling steps. Skip-layer\n\n3\n\n\fconnections exist between each downsampled feature map and the commensurate upsampled feature\nmap. We refer to [16] for the detailed structure of U-Net.\nWe observed that, when human experts label the ground truth, they tend to \ufb01rst zoom out the image\nto \ufb01gure out where are the target objects and then zoom in to label the accurate boundaries of\nthose targets. There are two critical mechanisms in kU-Net to simulate such human behaviors. (1)\nkU-Net employs a sequence of submodule FCNs to extract information at different scales sequentially\n(from the coarsest scale to the \ufb01nest scale). (2) The information extracted by the submodule FCN\nresponsible for a coarser scale will be propagated to the subsequent submodule FCN to assist the\nfeature extraction in a \ufb01ner scale.\nFirst, we create different scales of an original input 2D image by a series of connections of k \u2212 1\nmax-pooling layers. Let It be the image of scale t (t = 1, . . . , k), i.e., the result after t \u2212 1 max-\npooling layers (I1 is the original image). Each pixel in It corresponds to 2t\u22121 pixels in the original\nimage. Then, we use U-Net-t (t = 1, . . . , k), i.e., the t-th submodule, to process It. We keep the\ninput window size the same across all U-Nets by using crop layers. Intuitively, U-Net-1 to U-Net-k\nall have the same input size, while U-Net-1 views the smallest region with the highest resolution and\nU-Net-k views the largest region with the lowest resolution. In other words, for any 1 \u2264 t1 < t2 \u2264 k,\nU-Net-t2 is responsible for a larger image scale than U-Net-t1.\nSecond, we need to propagate the higher level information extracted by U-Net-t (2 \u2264 t \u2264 k) to\nthe next submodule, i.e., U-Net-(t \u2212 1), so that clues from a coarser scale can assist the work in\na \ufb01ner scale. A natural strategy is to copy the result from U-Net-t to the commensurate layer in\nU-Net-(t \u2212 1). As shown in Fig. 2, there are four typical ways to achieve this: (A) U-Net-(t \u2212 1) only\nuses the \ufb01nal result from U-Net-t and uses it at the start; (B) U-Net-(t \u2212 1) only uses the \ufb01nal result\nfrom U-Net-t and uses it at the end; (C) U-Net-(t \u2212 1) only uses the most abstract information from\nU-Net-t; (D) U-Net-(t \u2212 1) uses every piece of information from U-Net-t. Based on our trial studies,\ntype (A) and type (D) achieved the best performance. Since type (A) has fewer parameters than (D),\nwe chose type (A) as our \ufb01nal architecture to organize the sequence of submodule FCNs.\nFrom a different perspective, each submodule U-Net can be viewed as a \u201csuper layer\". Therefore,\nthe kU-Net is a \u201cdeep\u201d deep learning model. Because the parameter k exponentially increases the\ninput window size of the network, a small k is suf\ufb01cient to handle many biomedical images (we use\nk = 2 in our experiments). Appended with a 1\u00d71 convolution (to convert the number of channels in\nthe feature map) and a softmax layer, the kU-Net can be used for 2D segmentation problems. We\nwill show (see Table 1) that kU-Net (i.e., a sequence of collaborative U-Nets) can achieve better\nperformance than a single U-Net in terms of segmentation accuracy.\n\n2.2 The RNN Component: BDC-LSTM\n\nIn this section, we \ufb01rst review the classic LSTM network [8], and the generalized convolutional\nLSTM [14, 17, 18] (denoted by CLSTM). Next, we describe how our RNN component, called\nBDC-LSTM, is extended from CLSTM. Finally, we propose a deep architecture for BDC-LSTM,\nand discuss its advantages over other variants.\nLSTM and CLSTM: RNN (e.g., LSTM) is a neural network that maintains a self-connected internal\nstatus acting as a \u201cmemory\". The ability to \u201cremember\u201d what has been seen allows RNN to attain\nexceptional performance in processing sequential data.\nRecently, a generalized LSTM, denoted by CLSTM, was developed [14, 17, 18]. CLSTM explicitly\nassumes that the input is images and replaces the vector multiplication in LSTM gates by convolutional\noperators. It is particularly ef\ufb01cient in exploiting image sequences. For instance, it can be used for\nimage sequence prediction either in an encoder-decoder framework [17] or by combining with optical\n\ufb02ows [14]. Speci\ufb01cally, CLSTM can be formulated as follows.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\niz = \u03c3(xz \u2217 Wxi + hz\u22121 \u2217 Whi + bi)\nfz = \u03c3(xz \u2217 Wxf + hz\u22121 \u2217 Whf + bf )\ncz = cz\u22121 (cid:12) fz + iz (cid:12) tanh(xz \u2217 Wxc + hz\u22121 \u2217 Whc + bc)\noz = \u03c3(xz \u2217 Wxo + hz\u22121 \u2217 Who + bo)\nhz = oz (cid:12) tanh(cz)\n\n(1)\n\n4\n\n\fHere, \u2217 denotes convolution and (cid:12) denotes element-wise product. \u03c3() and tanh() are logistic\nsigmoid and hyperbolic tangent functions; iz, fz, oz are the input gate, forget gate, and output gate,\nbi, bf , bc, bo are bias terms, and xz, cz, hz are the input, the cell activation state, and the hidden\nstate, at slice z. W\u2217\u2217 are diagonal weight matrices governing the value transitions. For instance, Whf\ncontrols how the forget gate takes values from the hidden state. The input to CLSTM is a feature\nmap of size fin\u00d7lin\u00d7win, and the output is a feature map of size fout\u00d7lout\u00d7wout, lout\u2264 lin and\nwout\u2264 win. lout and wout depend on the size of the convolution kernels and whether padding is used.\nBDC-LSTM: We extend CLSTM to Bi-Directional Convolutional LSTM (BDC-LSTM). The key\nextension is to stack two layers of CLSTM, which work in two opposite directions (see Fig. 3(A)).\nThe contextual information carried in the two layers, one in z\u2212-direction and the other in z+-direction,\nis concatenated as output. It can be interpreted as follows. To determine the hidden state at a slice\nz, we take the 2D hierarchical features in slice z (i.e., xz) and the contextual information from both\nthe z+ and z\u2212 directions. One layer of CLSTM will integrate the information from the z\u2212-direction\n(resp., z+-direction) and xz to capture the minus-side (resp., plus-side) context (see Fig. 3(B)). Then,\nthe two one-side contexts (z+ and z\u2212) will be fused.\nIn fact, Pyramid-LSTM [18] can be viewed as a different extension of CLSTM, which employs six\nCLSTMs in six different directions (x+/\u2212, y+/\u2212, and z+/\u2212) and sums up the outputs of the six\nCLSTMs. However, useful information may be lost during the output summation. Intuitively, the sum\nof six outputs can only inform a simpli\ufb01ed context instead of the exact situations in different directions.\nIt should be noted that concatenating six outputs may greatly increase the memory consumption, and\nis thus impractical in Pyramid-LSTM. Hence, besides avoiding problematic convolutions on the xz\nand yz planes (as discussed in Section 1), BDC-LSTM is in principle more effective in exploiting\ninter-slice contexts than Pyramid-LSTM.\nDeep Architectures: Multiple BDC-LSTMs can be stacked into a deep structure by taking the output\nfeature map of one BDC-LSTM as the input to another BDC-LSTM. In this sense, each BDC-LSTM\ncan be viewed as a super \u201clayer\" in the deep structure. Besides simply taking one output as another\ninput, we can also insert other operations, like max-pooling or deconvolution, in between BDC-LSTM\nlayers. As a consequence, deep architectures for 2D CNN can be easily migrated or generalized to\nbuild deep architectures for BDC-LSTM. This is shown in Fig. 3(C)-(D). The underlying relationship\nbetween deep BDC-LSTM and 2D deep CNN is that deep CNN extracts a hierarchy of non-linear\nfeatures from a 2D image and a deeper layer aims to interpret higher level information of the image,\nwhile deep BDC-LSTM extracts a hierarchy of hierarchical contextual features from the 3D context\nand a deeper BDC-LSTM layer seeks to interpret higher level 3D contexts.\nIn [14, 17, 18], multiple CLSTMs were simply stacked one by one, maybe with different kernel sizes,\nin which a CLSTM \u201clayer\u201d may be viewed as a degenerated BDC-LSTM \u201clayer\u201d. When considering\nthe problem in the context of CNN, as discussed above, one can see that no feature hierarchy was\neven formed in these simple architectures. Usually, convolutional layers are followed by subsampling,\nsuch as max-pooling, in order to form the hierarchy.\nWe propose a deep architecture combining max-pooling, dropout and deconvolution layers with the\nBDC-LSTM layers. The detailed structure is as follows (the numbers in parentheses indicate the size\nchanges of the feature map in each 2D slice). Input (64\u00d7126\u00d7126), dropout layer with p = 0.5, two\nBDC-LSTMs with 64 hidden units and 5\u00d75 kernels (64\u00d7118\u00d7118), 2\u00d72 max-pooling (64\u00d759\u00d759),\ndropout layer with p = 0.5, two BDC-LSTMs with 64 hidden units and 5\u00d75 kernels (64\u00d751\u00d751), 2\u00d72\ndeconvolution (64\u00d7102 \u00d7102), dropout layer with p = 0.5, 3\u00d73 convolution layer without recurrent\nconnections (64\u00d7100\u00d7100), 1\u00d71 convolution layer without recurrent connections (2\u00d7100\u00d7100).\n(Note: All convolutions in BDC-LSTM use the same kernel size as indicated in the layers.) Thus,\nto predict the probability map of a 100\u00d7100 region, we need the 126\u00d7126 region centered at the\nsame position as the input. In the evaluation stage, the whole feature map can be processed using the\noverlapping-tile strategy [16], because deep BDC-LSTM is fully convolutional along the z-direction.\nSuppose the feature map of a whole slice is of size 64\u00d7W \u00d7H. The input tensor will be padded with\nzeros on the borders to resize into 64\u00d7(W +26)\u00d7(H +26). Then, a sequence of 64\u00d7126 \u00d7126\npatches will be processed each time. The results are stitched to form the 3D segmentation.\n\n5\n\n\fFigure 3: (A) The structure of BDC-LSTM, where two layers of CLSTM modules are connected in a\nbi-directional manner. (B) A graphical illustration of information propagation through BDC-LSTM\nalong the z-direction. (C) The circuit diagram of BDC-LSTM. The green arrows represent the\nrecurrent connections in opposite directions. When rotating this diagram by 90 degrees, it has a\nsimilar structure of a layer in CNN, except the recurrent connections. (D) The deep structure of\nBDC-LSTM used in our method. BDC-LSTM can be stacked in a way analogous to a layer in CNN.\nThe red arrows are 5 \u00d7 5 convolutions. The yellow and purple arrows indicate max-pooling and\ndeconvolution, respectively. The rightmost blue arrow indicates a 1 \u00d7 1 convolution. Dropout is\napplied (not shown) after the input layer, the max-pooling layer and the deconvolution layer.\n\n2.3 Combining kU-Net and BDC-LSTM\n\nThe motivation of solving 3D segmentation by combining FCN (kU-Net) and RNN (BDC-LSTM) is\nto distribute the burden of exploiting 3D contexts. kU-Net extracts and compresses the hierarchy of\nintra-slice contexts into feature maps, and BDC-LSTM distills the 3D context from a sequence of\nabstracted 2D contexts. These two components work coordinately, as follows.\nSuppose the 3D image consists of Nz 2D slices of size Nx \u00d7 Ny each. First, kU-Net extracts feature\nmaps of size 64 \u00d7 Nx \u00d7 Ny, denoted by f z\n2D, from each slice z. The overlapping-tile strategy [16]\nwill be adopted when the 2D images are too big to be processed by kU-Net in one shot. Second,\nBDC-LSTM works on f z\n2D to build the hierarchy of non-linear features from 3D contexts and\ngenerate another 64 \u00d7 Nx \u00d7 Ny feature map, denoted by f z\n3D, z = 1, . . . Nz. For each slice z, f h\n(h = z\u2212\u03c1, . . . , z, . . . , z +\u03c1) will serve as the context (\u03c1 = 1 in our implementation). Finally, a\nsoftmax function is applied to f z\n\n3D to generate the 3D segmentation probability map.\n\n2D\n\n2.4 Training Strategy\n\nOur whole network, including kU-Net and BDC-LSTM, can be trained either end-to-end or in a decou-\npled manner. Sometimes, biomedical images are too big to be processed as a whole. Overlapping-tile\nis a common approach [16], but can also reduce the range of the context utilized by the networks.\nThe decoupled training, namely, training kU-Net and BDC-LSTM separately, is especially useful\nin situations where the effective context of each voxel is very large. Given the same amount of\ncomputing resources (e.g., GPU memory), when allocating all resources to train one component\nonly, both kU-Net and BDC-LSTM can take much larger tiles as input. In practice, even though the\nend-to-end training has its advantage of simplicity and consistency, the decoupled training strategy is\npreferred for challenging problems.\nkU-Net is initialized using the strategy in [7] and trained using Adam [9], with \ufb01rst moment coef\ufb01cient\n(\u03b21)=0.9, second moment coef\ufb01cient (\u03b22)=0.999, \u0001=1e\u221210, and a constant learning rate 5e\u22125. The\ntraining method for BDC-LSTM is Rms-prop [6], with smoothing constant (\u03b1)=0.9 and \u0001=1e\u22125.\nThe initial learning rate is set as 1e\u22123 and halves every 2000 iterations, until 1e\u22125. In each iteration,\none training example is randomly selected. The training data is augmented with rotation, \ufb02ipping,\nand mirroring. To avoid gradient explosion, the gradient is clipped to [\u22125, 5] in each iteration. The\nparameters in BDC-LSTM are initialized with random values uniformly selected from [\u22120.02, 0.02].\nWe use a weighted cross-entropy loss in both the kU-Net and BDC-LSTM training. In biomedical\nimage segmentation, there may often be certain important regions in which errors should be reduced\n\n6\n\n\fTable 1: Experimental results on the ISBI neuron dataset and in-house 3D fungus datasets.\n\nMethod\nPyramid-LSTM [18]\nU-Net [16]\nTri-Planar [15]\n3D Conv [10]\nOurs (FCN only)\nOurs (FCN+simple RNN)\nOurs (FCN+deep RNN)\n\nNeuron\n\nVrand\n0.9677\n0.9728\n0.8462\n0.8178\n0.9749\n0.9742\n0.9753\n\nVinf o\n0.9829\n0.9866\n0.9180\n0.9125\n0.9869\n0.9869\n0.9870\n\nFungus\n\nPixel Error\n\nN/A\n0.0263\n0.0375\n0.0630\n0.0242\n0.0241\n0.0215\n\nas much as possible. For instance, when two objects touch tightly to each other, it is important to\nmake correct segmentation along the separating boundary between the two objects, while errors near\nthe non-touching boundaries are of less importance. Hence, we adopt the idea in [16] to assign a\nunique weight for each voxel in the loss calculation.\n\n3 Experiments\n\nOur framework was implemented in Torch7 [5] and the RNN package [12]. We conducted experiments\non a workstation with 12GB NVIDIA TESLA K40m GPU, using CuDNN library (v5) for GPU\nacceleration. Our approach was evaluated in two 3D segmentation applications and compared with\nseveral state-of-the-art DL methods.\n3D Neuron Structures: The \ufb01rst evaluation dataset was from the ISBI challenge on the segmentation\nof neuronal structures in 3D electron microscopic (EM) images [1]. The objective is to segment the\nneuron boundaries. Brie\ufb02y, there are two image stacks of 512 \u00d7 512 \u00d7 30 voxels, where each voxel\nmeasures 4 \u00d7 4 \u00d7 50\u00b5m. Noise and section alignment errors exist in both stacks. One stack (with\nground truth) was used for training, and the other was for evaluation. We adopted the same metrics\nas in [1], i.e., foreground-restricted rand score (Vrand) and information theoretic score (Vinf o) after\nborder thinning. As shown in [1], Vrand and Vinf o are good approximation to the dif\ufb01culty for human\nto correct the segmentation errors, and are robust to border variations due to the thickness.\n3D Fungus Structures: Our method was also evaluated on in-house datasets for the segmentation of\ntubular fungus structures in 3D images from Serial Block-Face Scanning Electron Microscope. The\nratio of the voxel scales is x : y : z = 1 : 1 : 3.45. There are \ufb01ve stacks, in all of which each slice is\na grayscale image of 853 \u00d7 877 pixels. We manually labeled the \ufb01rst 16 slices in one stack as the\ntraining data and used the other four stacks, each containing 81 sections, for evaluation. The metric\nto quantify the segmentation accuracy is pixel error, de\ufb01ned as the Euclidean distance between the\nground truth label (0 or 1) and segmentation probability (a value in the range of [0, 1]). Note that we\ndo not use the same metric as the neuron dataset, because the \u201cborder thinning\" is not applicable to\nthe fungus datasets. The pixel error was actually adopted at the time of the ISBI neuron segmentation\nchallenge, which is also a well-recognized metric to quantify pixel-level accuracy. It is also worth\nmentioning that it is impractical to label four stacks for evaluation due to intensive labor. Hence,\nwe prepared the ground truth every 5 sections in each evaluation stack (i.e., 5, 10, 15, . . ., 75, 80).\nTotally, 16 sections were selected to estimate the performance on a whole stack. Namely, all 81\nsections in each stack were segmented, but 16 of them were used to compute the evaluation score in\nthe corresponding stack. The reported performance is the average of the scores for all four stacks.\nRecall the four categories of known deep learning based 3D segmentation methods described in\nSection 1. We selected one typical method from each category for comparison. (1) U-Net [16],\nwhich achieved the state-of-the-art segmentation accuracy on 2D biomedical images, is selected as\nthe representative scheme of linking 2D segmentations into 3D results. (Note: We are aware of the\nmethod [3] which is another variant of 2D FCN and achieved excellent performance on the neuron\ndataset. But, different from U-Net, the generality of [3] in different applications is not yet clear. Our\ntest of [3] on the in-house datasets showed an at least 5% lower F1-score than U-Net. Thus, we\ndecided to take U-Net as the representative method in this category.) (2) 3D-Conv [10] is a method\nusing CNN with 3D convolutions. (3) Tri-planar [15] is a classic solution to avoid high computing\n\n7\n\n\fFigure 4: (A) A cropped region in a 2D fungus image. (B) The result using only the FCN component.\n(C) The result of combining FCN and RNN. (D) The true fungi to be segmented in (A).\n\ncosts of 3D convolutions, which replaces 3D convolution with three 2D convolutions on orthogonal\nplanes. (4) Pyramid-LSTM [18] is the best known generalized LSTM networks for 3D segmentation.\nResults: The results on the 3D neuron dataset and the fungus datasets are shown in Table 1. It is\nevident that our proposed kU-Net, when used alone, achieves considerable improvement over U-Net\n[16]. Our approach outperforms the known DL methods utilizing 3D contexts. Moreover, one can\nsee that our proposed deep architecture achieves better performance than simply stacking multiple\nBDC-LSTMs together. As discussed in Section 2.2, adding subsampling layers like in 2D CNN\nmakes the RNN component able to perceive higher level 3D contexts. It worth mentioning that our\ntwo evaluation datasets are quite representative. The fungus data has small anisotropism (z resolution\nis close to xy resolution). The 3D neuron dataset has large anisotropism (z resolution is much less\nthan xy resolution). The effectiveness of our framework on handling and leveraging anisotropism\ncan be demonstrated.\nWe should mention that we re-implemented Pyramid-LSTM [18] in Torch7 and tested it on the fungus\ndatasets. But, the memory requirement of Pyramid-LSTM, when implemented in Torch7, was too\nlarge for our GPU. For the original network structure, the largest possible cubical region to process\neach time within our GPU memory capacity was 40 \u00d7 40 \u00d7 8. Using the same hyper-parameters\nin [18], we cannot obtain acceptable results due to the limited processing cube. (The result of\nPyramid-LSTM on the 3D neuron dataset was fetched from the ISBI challenge leader board1 on\nMay 10, 2016.) Here, one may see that our method is much more ef\ufb01cient in GPU memory, when\nimplemented under the same deep learning framework and tested on the same machine.\nSome results are shown in Fig. 4 to qualitatively compare the results using the FCN component\nalone and the results of combining RNN and FCN. In general, both methods make nearly no false\nnegative errors. But, the RNN component can help to (1) suppress false positive errors by maintaining\ninter-slice consistency, and (2) make more con\ufb01dent prediction in ambiguous cases by leveraging\nthe 3D context. In a nutshell, FCN collects as much discriminative information as possible within\neach slice and RNN makes further re\ufb01nement according to inter-slice correlation, so that an accurate\nsegmentation can be made at each voxel.\n\n4 Conclusions and Future Work\n\nIn this paper, we introduce a new deep learning framework for 3D image segmentation, based on\na combination of an FCN (i.e., kU-Net) to exploit 2D contexts and an RNN (i.e., BDC-LSTM) to\nintegrate contextual information along the z-direction. Evaluated in two different 3D biomedical\nimage segmentation applications, our proposed approach can achieve the state-of-the-art performance\nand outperform known DL schemes utilizing 3D contexts. Our framework provides a new paradigm\nto migrate the superior performance of 2D deep architectures to exploit 3D contexts. Following\nthis new paradigm, we will explore BDC-LSTMs in different deep architectures to achieve fur-\nther improvement and conduct more extensive evaluations on different datasets, such as BraTS\n(http://www.braintumorsegmentation.org/) and MRBrainS (http://mrbrains13.isi.uu.nl).\n\n5 Acknowledgement\n\nThis research was support in part by NSF Grants CCF-1217906 and CCF-1617735 and NIH Grants\nR01-GM095959 and U01-HL116330. Also, we would like to thank Dr. Viorica Patraucean at\nUniversity of Cambridge (UK) for discussion of BDC-LSTM, and Prof. David P. Hughes and\nDr. Maridel Fredericksen at Pennsylvania State University (US) for providing the 3D fungus datasets.\n\n1http://brainiac2.mit.edu/isbi_challenge/leaders-board-new\n\n8\n\n\fReferences\n[1] A. Cardona, S. Saalfeld, S. Preibisch, B. Schmid, A. Cheng, J. Pulokas, P. Tomancak, and V. Hartenstein.\nAn integrated micro-and macroarchitectural analysis of the drosophila brain by computer-assisted serial\nsection electron microscopy. PLoS Biol, 8(10):e1000502, 2010.\n\n[2] H. Chen, X. Qi, L. Yu, and P.-A. Heng. Dcan: Deep contour-aware networks for accurate gland segmenta-\n\ntion. arXiv preprint arXiv:1604.02677, 2016.\n\n[3] H. Chen, X. J. Qi, J. Z. Cheng, and P. A. Heng. Deep contextual networks for neuronal structure\n\nsegmentation. In AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[4] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal\n\nmembranes in electron microscopy images. In NIPS, pages 2843\u20132851, 2012.\n\n[5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like environment for machine learning.\n\nIn BigLearn, NIPS Workshop, 2011.\n\n[6] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for\n\nnon-convex optimization. arXiv preprint arXiv:1502.04390, 2015.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance on\n\nimagenet classi\ufb01cation. In CVPR, pages 1026\u20131034, 2015.\n\n[8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n\n[9] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n\n[10] M. Lai. Deep learning for medical image segmentation. arXiv preprint arXiv:1505.02000, 2015.\n\n[11] K. Lee, A. Zlateski, V. Ashwin, and H. S. Seung. Recursive training of 2D-3D convolutional networks for\n\nneuronal boundary prediction. In NIPS, pages 3559\u20133567, 2015.\n\n[12] N. L\u00e9onard, S. Waghmare, and Y. Wang. rnn: Recurrent library for Torch. arXiv preprint arXiv:1511.07889,\n\n2015.\n\n[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\npages 3431\u20133440, 2015.\n\n[14] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory.\n\narXiv preprint arXiv:1511.06309, 2015.\n\n[15] A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen. Deep feature learning for knee\ncartilage segmentation using a triplanar convolutional neural network. In MICCAI, pages 246\u2013253, 2013.\n\n[16] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmenta-\n\ntion. In MICCAI, pages 234\u2013241, 2015.\n\n[17] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W. chun Woo. Convolutional lstm network: A\n\nmachine learning approach for precipitation nowcasting. arXiv preprint arXiv:1506.04214, 2015.\n\n[18] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber. Parallel multi-dimensional LSTM, with\n\napplication to fast biomedical volumetric image segmentation. In NIPS, pages 2980\u20132988, 2015.\n\n[19] C. Sun, M. Paluri, R. Collobert, R. Nevatia, and L. Bourdev. Pronet: Learning to propose object-speci\ufb01c\n\nboxes for cascaded neural networks. arXiv preprint arXiv:1511.03776, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1507, "authors": [{"given_name": "Jianxu", "family_name": "Chen", "institution": "University of Notre Dame"}, {"given_name": "Lin", "family_name": "Yang", "institution": "University of Notre Dame"}, {"given_name": "Yizhe", "family_name": "Zhang", "institution": "University of Notre Dame"}, {"given_name": "Mark", "family_name": "Alber", "institution": "University of Notre Dame"}, {"given_name": "Danny", "family_name": "Chen", "institution": "University of Notre Dame"}]}