Semantic labeling for the very high resolution (VHR) image of urban areas is challenging, because of many complex manmade objects with different materials and fine-structured objects located together. Under the framework of convolutional neural networks (CNNs), this paper proposes a novel end-to-end network for semantic labeling. Specifically, our network not only improves the labeling accuracy of complex manmade objects by aggregating multiple context semantics with a cascaded architecture, but also refines fine-structured objects by utilizing the low-level detail in shallow layers of CNNs with a hierarchical pyramid structure. Throughout the network, a dedicated residual correction scheme is employed to amend the latent fitting residual. As a result of these specific components, the whole model works in a global-to-local and coarse-to-fine manner. Experimental results show that our network outperforms the state-of-the-art methods on the large-scale ISPRS Vaihingen 2D Semantic Labeling Challenge dataset.