Abstract:
Bridging the gap between national park management practices and public perceptions hinges on a precise understanding of visitors′ perceptions of cultural ecosystem services (CES). Current research in this domain predominantly relies on single-modal data, which fails to capture the multidimensional nature of visitor experiences. This study develops and validates a dual-perspective analytical framework that integrates textual and visual data. Unlike previous studies that depend on strictly paired data, this framework is specifically designed to accommodate the inherently unpaired nature of user-generated content (UGC) on social media, where texts and images often interpret the same experience from different yet complementary dimensions. Taking Wuyi Mountain National Park as an empirical case, a fine-tuned BERT model is applied to perform multi-label CES classification and sentiment analysis on 2741 textual reviews. A fine-tuned ResNet-50 model is used to define 14 visual categories aligned with the CES typology, identifying landscape elements in 3521 visitor photographs. Analytical results from both modalities are integrated into a unified CES classification framework to enable a macro-level comparison between textual and visual perspectives. There are three core findings. First, texts and images exhibit significant complementarity in representing CES. Texts excel at conveying intrinsic spiritual experiences and recreational sentiments, capturing internal emotional states that are difficult to represent visually. In contrast, images are more adept at documenting tangible aesthetic landscapes and educational scenes, such as geological formations, vegetation, and interpretive facilities. This complementary relationship between implicit perceptions and explicit landscapes suggests that single-modality data can lead to perceptual bias. Relying solely on text risks overlooking visually prominent landscape elements, while analyzing only images tends to underrepresent visitors′ intrinsic experiential values. Second, the integrated dual-perspective analysis reveals that visitor perceptions are dominated by three core dimensions: aesthetic value, recreational value, and spiritual value. In other words, the visitor experience functions as an integrated whole rather than isolated service categories. In comparison, cultural heritage and educational functions remain relatively peripheral and exhibit a pattern of "see more but say less", although images capture educational scenes far more frequently than textual descriptions. Third, sentiment analysis identifies four interrelated management pain points: pricing, service quality, queuing, and transportation. These issues primarily affect recreational and aesthetic services and constitute the main drivers of negative visitor experiences. From a methodological perspective, unpaired social media data are utilized, and a replicable and generalizable analytical framework is constructed, offering more comprehensive empirical support for assessing visitor perceptions in national parks. From a practical standpoint, the hierarchical and complementary structure of CES perceptions are clarified, and management shortcomings are identified through sentiment feedback, providing actionable references for optimizing visitor experiences and addressing operational challenges. Finally, the adaptability of this framework to other protected areas, along with its potential for integration with emerging multimodal large models, is discussed to chart directions for future research.