向量数据库与文档处理 - 在线学习与练习平台

5.1 向量数据库基础

5.1.1 向量数据库概念

向量数据库是专门用于存储和检索高维向量数据的数据库系统，在AI应用中主要用于语义搜索、相似性匹配和检索增强生成（RAG）。

核心概念： 1. 向量嵌入：将文本、图像等数据转换为数值向量 2. 相似性搜索：基于向量距离的相似内容检索 3. 索引优化：高效的向量索引算法（如HNSW、IVF） 4. 元数据过滤：结合结构化数据的混合查询

5.1.2 Spring AI向量存储抽象

// VectorStore接口
public interface VectorStore {
    
    /**
     * 添加文档向量
     */
    void add(List<Document> documents);
    
    /**
     * 删除文档向量
     */
    Optional<Boolean> delete(List<String> idList);
    
    /**
     * 相似性搜索
     */
    List<Document> similaritySearch(String query);
    
    /**
     * 带参数的相似性搜索
     */
    List<Document> similaritySearch(SearchRequest request);
}

// Document类
public class Document {
    private String id;
    private String content;
    private Map<String, Object> metadata;
    private List<Double> embedding;
    
    public Document(String content) {
        this(UUID.randomUUID().toString(), content, new HashMap<>());
    }
    
    public Document(String content, Map<String, Object> metadata) {
        this(UUID.randomUUID().toString(), content, metadata);
    }
    
    public Document(String id, String content, Map<String, Object> metadata) {
        this.id = id;
        this.content = content;
        this.metadata = metadata;
    }
    
    // Getters and Setters
    public String getId() { return id; }
    public void setId(String id) { this.id = id; }
    
    public String getContent() { return content; }
    public void setContent(String content) { this.content = content; }
    
    public Map<String, Object> getMetadata() { return metadata; }
    public void setMetadata(Map<String, Object> metadata) { this.metadata = metadata; }
    
    public List<Double> getEmbedding() { return embedding; }
    public void setEmbedding(List<Double> embedding) { this.embedding = embedding; }
}

// SearchRequest类
public class SearchRequest {
    private String query;
    private int topK = 4;
    private double similarityThreshold = 0.0;
    private Filter filterExpression;
    
    public static SearchRequest query(String query) {
        return new SearchRequest(query);
    }
    
    public SearchRequest withTopK(int topK) {
        this.topK = topK;
        return this;
    }
    
    public SearchRequest withSimilarityThreshold(double threshold) {
        this.similarityThreshold = threshold;
        return this;
    }
    
    public SearchRequest withFilterExpression(Filter filter) {
        this.filterExpression = filter;
        return this;
    }
    
    // Getters and Setters
    public String getQuery() { return query; }
    public int getTopK() { return topK; }
    public double getSimilarityThreshold() { return similarityThreshold; }
    public Filter getFilterExpression() { return filterExpression; }
}

5.2 向量数据库集成

5.2.1 Chroma数据库集成

// ChromaVectorStoreConfig.java
package com.example.springai.config;

import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.ChromaVectorStore;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Profile;

@Configuration
@Profile("chroma")
public class ChromaVectorStoreConfig {
    
    @Bean
    public ChromaVectorStore chromaVectorStore(EmbeddingModel embeddingModel) {
        return ChromaVectorStore.builder()
            .embeddingModel(embeddingModel)
            .host("localhost")
            .port(8000)
            .collectionName("spring_ai_docs")
            .build();
    }
}

// ChromaVectorService.java
package com.example.springai.service;

import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.ChromaVectorStore;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;

@Service
public class ChromaVectorService {
    
    private final ChromaVectorStore vectorStore;
    
    public ChromaVectorService(ChromaVectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }
    
    /**
     * 添加文档到向量数据库
     */
    public void addDocuments(List<Document> documents) {
        vectorStore.add(documents);
    }
    
    /**
     * 语义搜索
     */
    public List<Document> semanticSearch(String query, int topK) {
        SearchRequest request = SearchRequest.query(query)
            .withTopK(topK)
            .withSimilarityThreshold(0.7);
        
        return vectorStore.similaritySearch(request);
    }
    
    /**
     * 带元数据过滤的搜索
     */
    public List<Document> searchWithMetadata(String query, Map<String, Object> metadataFilter) {
        // 构建过滤器
        Filter filter = buildMetadataFilter(metadataFilter);
        
        SearchRequest request = SearchRequest.query(query)
            .withTopK(10)
            .withFilterExpression(filter);
        
        return vectorStore.similaritySearch(request);
    }
    
    /**
     * 删除文档
     */
    public boolean deleteDocuments(List<String> documentIds) {
        return vectorStore.delete(documentIds).orElse(false);
    }
    
    /**
     * 构建元数据过滤器
     */
    private Filter buildMetadataFilter(Map<String, Object> metadataFilter) {
        Filter.Builder builder = new Filter.Builder();
        
        for (Map.Entry<String, Object> entry : metadataFilter.entrySet()) {
            builder.eq(entry.getKey(), entry.getValue());
        }
        
        return builder.build();
    }
}

5.2.2 Pinecone数据库集成

// PineconeVectorStoreConfig.java
package com.example.springai.config;

import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.PineconeVectorStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Profile;

@Configuration
@Profile("pinecone")
public class PineconeVectorStoreConfig {
    
    @Value("${spring.ai.vectorstore.pinecone.api-key}")
    private String apiKey;
    
    @Value("${spring.ai.vectorstore.pinecone.environment}")
    private String environment;
    
    @Value("${spring.ai.vectorstore.pinecone.project-id}")
    private String projectId;
    
    @Value("${spring.ai.vectorstore.pinecone.index-name}")
    private String indexName;
    
    @Bean
    public PineconeVectorStore pineconeVectorStore(EmbeddingModel embeddingModel) {
        return PineconeVectorStore.builder()
            .apiKey(apiKey)
            .environment(environment)
            .projectId(projectId)
            .indexName(indexName)
            .embeddingModel(embeddingModel)
            .build();
    }
}

// PineconeVectorService.java
package com.example.springai.service;

import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.PineconeVectorStore;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;

@Service
public class PineconeVectorService {
    
    private final PineconeVectorStore vectorStore;
    
    public PineconeVectorService(PineconeVectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }
    
    /**
     * 批量添加文档
     */
    public void batchAddDocuments(List<Document> documents, int batchSize) {
        for (int i = 0; i < documents.size(); i += batchSize) {
            int endIndex = Math.min(i + batchSize, documents.size());
            List<Document> batch = documents.subList(i, endIndex);
            vectorStore.add(batch);
            
            // 添加延迟以避免API限制
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                throw new RuntimeException("批量添加被中断", e);
            }
        }
    }
    
    /**
     * 高级搜索
     */
    public List<Document> advancedSearch(String query, SearchOptions options) {
        SearchRequest.Builder requestBuilder = SearchRequest.query(query)
            .withTopK(options.getTopK())
            .withSimilarityThreshold(options.getSimilarityThreshold());
        
        // 添加命名空间过滤
        if (options.getNamespace() != null) {
            requestBuilder.withFilterExpression(
                new Filter.Builder().eq("namespace", options.getNamespace()).build()
            );
        }
        
        return vectorStore.similaritySearch(requestBuilder.build());
    }
    
    /**
     * 搜索选项
     */
    public static class SearchOptions {
        private int topK = 5;
        private double similarityThreshold = 0.7;
        private String namespace;
        
        // Getters and Setters
        public int getTopK() { return topK; }
        public SearchOptions setTopK(int topK) { this.topK = topK; return this; }
        
        public double getSimilarityThreshold() { return similarityThreshold; }
        public SearchOptions setSimilarityThreshold(double threshold) { 
            this.similarityThreshold = threshold; 
            return this; 
        }
        
        public String getNamespace() { return namespace; }
        public SearchOptions setNamespace(String namespace) { 
            this.namespace = namespace; 
            return this; 
        }
    }
}

5.2.3 Redis向量数据库集成

// RedisVectorStoreConfig.java
package com.example.springai.config;

import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.RedisVectorStore;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Profile;
import org.springframework.data.redis.connection.RedisConnectionFactory;
import org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory;
import org.springframework.data.redis.core.RedisTemplate;

@Configuration
@Profile("redis")
public class RedisVectorStoreConfig {
    
    @Bean
    public RedisConnectionFactory redisConnectionFactory() {
        return new LettuceConnectionFactory("localhost", 6379);
    }
    
    @Bean
    public RedisTemplate<String, Object> redisTemplate(RedisConnectionFactory connectionFactory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(connectionFactory);
        return template;
    }
    
    @Bean
    public RedisVectorStore redisVectorStore(
            EmbeddingModel embeddingModel,
            RedisTemplate<String, Object> redisTemplate
    ) {
        return RedisVectorStore.builder()
            .embeddingModel(embeddingModel)
            .redisTemplate(redisTemplate)
            .indexName("spring_ai_index")
            .prefix("doc:")
            .build();
    }
}

// RedisVectorService.java
package com.example.springai.service;

import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.RedisVectorStore;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;

@Service
public class RedisVectorService {
    
    private final RedisVectorStore vectorStore;
    
    public RedisVectorService(RedisVectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }
    
    /**
     * 添加文档并设置TTL
     */
    public void addDocumentsWithTTL(List<Document> documents, long ttlSeconds) {
        // 为每个文档添加TTL元数据
        documents.forEach(doc -> {
            doc.getMetadata().put("ttl", ttlSeconds);
            doc.getMetadata().put("created_at", System.currentTimeMillis());
        });
        
        vectorStore.add(documents);
    }
    
    /**
     * 基于标签的搜索
     */
    public List<Document> searchByTags(String query, List<String> tags) {
        Filter.Builder filterBuilder = new Filter.Builder();
        
        // 构建标签过滤器
        for (String tag : tags) {
            filterBuilder.eq("tags", tag);
        }
        
        SearchRequest request = SearchRequest.query(query)
            .withTopK(10)
            .withFilterExpression(filterBuilder.build());
        
        return vectorStore.similaritySearch(request);
    }
    
    /**
     * 范围搜索
     */
    public List<Document> rangeSearch(String query, String field, double minValue, double maxValue) {
        Filter filter = new Filter.Builder()
            .gte(field, minValue)
            .lte(field, maxValue)
            .build();
        
        SearchRequest request = SearchRequest.query(query)
            .withTopK(20)
            .withFilterExpression(filter);
        
        return vectorStore.similaritySearch(request);
    }
    
    /**
     * 清理过期文档
     */
    public int cleanupExpiredDocuments() {
        long currentTime = System.currentTimeMillis();
        
        // 搜索所有文档以检查TTL
        SearchRequest request = SearchRequest.query("*")
            .withTopK(1000);
        
        List<Document> allDocs = vectorStore.similaritySearch(request);
        List<String> expiredIds = allDocs.stream()
            .filter(doc -> isExpired(doc, currentTime))
            .map(Document::getId)
            .toList();
        
        if (!expiredIds.isEmpty()) {
            vectorStore.delete(expiredIds);
        }
        
        return expiredIds.size();
    }
    
    /**
     * 检查文档是否过期
     */
    private boolean isExpired(Document doc, long currentTime) {
        Map<String, Object> metadata = doc.getMetadata();
        
        if (!metadata.containsKey("ttl") || !metadata.containsKey("created_at")) {
            return false;
        }
        
        long createdAt = (Long) metadata.get("created_at");
        long ttl = (Long) metadata.get("ttl");
        
        return (currentTime - createdAt) > (ttl * 1000);
    }
}

5.3 文档处理与分割

5.3.1 文档加载器

// DocumentLoaderService.java
package com.example.springai.service;

import org.springframework.ai.document.Document;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.ai.reader.JsonReader;
import org.springframework.ai.reader.TextReader;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

@Service
public class DocumentLoaderService {
    
    /**
     * 加载文本文件
     */
    public List<Document> loadTextFile(Resource resource) {
        TextReader textReader = new TextReader(resource);
        return textReader.get();
    }
    
    /**
     * 加载PDF文件（按页分割）
     */
    public List<Document> loadPdfByPage(Resource resource) {
        PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(resource);
        return pdfReader.get();
    }
    
    /**
     * 加载PDF文件（按段落分割）
     */
    public List<Document> loadPdfByParagraph(Resource resource) {
        ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(resource);
        return pdfReader.get();
    }
    
    /**
     * 加载JSON文件
     */
    public List<Document> loadJsonFile(Resource resource, String... jsonKeysToUse) {
        JsonReader jsonReader = new JsonReader(resource, jsonKeysToUse);
        return jsonReader.get();
    }
    
    /**
     * 批量加载目录中的文件
     */
    public List<Document> loadDirectory(Path directoryPath) throws IOException {
        List<Document> allDocuments = new ArrayList<>();
        
        Files.walk(directoryPath)
            .filter(Files::isRegularFile)
            .forEach(filePath -> {
                try {
                    String fileName = filePath.getFileName().toString().toLowerCase();
                    Resource resource = new org.springframework.core.io.FileSystemResource(filePath);
                    
                    List<Document> documents;
                    if (fileName.endsWith(".txt")) {
                        documents = loadTextFile(resource);
                    } else if (fileName.endsWith(".pdf")) {
                        documents = loadPdfByParagraph(resource);
                    } else if (fileName.endsWith(".json")) {
                        documents = loadJsonFile(resource);
                    } else {
                        return; // 跳过不支持的文件类型
                    }
                    
                    // 添加文件元数据
                    documents.forEach(doc -> {
                        doc.getMetadata().put("source_file", fileName);
                        doc.getMetadata().put("file_path", filePath.toString());
                        doc.getMetadata().put("file_size", filePath.toFile().length());
                        doc.getMetadata().put("loaded_at", System.currentTimeMillis());
                    });
                    
                    allDocuments.addAll(documents);
                    
                } catch (Exception e) {
                    System.err.println("加载文件失败: " + filePath + ", 错误: " + e.getMessage());
                }
            });
        
        return allDocuments;
    }
    
    /**
     * 自定义文档创建
     */
    public Document createDocument(String content, Map<String, Object> metadata) {
        Document document = new Document(content, metadata);
        
        // 添加默认元数据
        document.getMetadata().putIfAbsent("created_at", System.currentTimeMillis());
        document.getMetadata().putIfAbsent("content_length", content.length());
        document.getMetadata().putIfAbsent("content_type", "text/plain");
        
        return document;
    }
    
    /**
     * 从URL加载文档
     */
    public List<Document> loadFromUrl(String url) {
        try {
            Resource resource = new org.springframework.web.client.RestTemplate()
                .getForObject(url, org.springframework.core.io.Resource.class);
            
            if (resource != null) {
                TextReader textReader = new TextReader(resource);
                List<Document> documents = textReader.get();
                
                // 添加URL元数据
                documents.forEach(doc -> {
                    doc.getMetadata().put("source_url", url);
                    doc.getMetadata().put("loaded_from", "url");
                });
                
                return documents;
            }
        } catch (Exception e) {
            System.err.println("从URL加载文档失败: " + url + ", 错误: " + e.getMessage());
        }
        
        return new ArrayList<>();
    }
}

5.3.2 文档分割器

// DocumentSplitterService.java
package com.example.springai.service;

import org.springframework.ai.document.Document;
import org.springframework.ai.transformer.splitter.TextSplitter;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.stereotype.Service;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.regex.Pattern;

@Service
public class DocumentSplitterService {
    
    /**
     * 基于Token的分割
     */
    public List<Document> splitByTokens(List<Document> documents, int chunkSize, int chunkOverlap) {
        TokenTextSplitter splitter = new TokenTextSplitter(chunkSize, chunkOverlap);
        return splitter.apply(documents);
    }
    
    /**
     * 基于字符数的分割
     */
    public List<Document> splitByCharacters(List<Document> documents, int chunkSize, int chunkOverlap) {
        List<Document> splitDocuments = new ArrayList<>();
        
        for (Document doc : documents) {
            List<Document> chunks = splitDocumentByCharacters(doc, chunkSize, chunkOverlap);
            splitDocuments.addAll(chunks);
        }
        
        return splitDocuments;
    }
    
    /**
     * 基于段落的分割
     */
    public List<Document> splitByParagraphs(List<Document> documents) {
        List<Document> splitDocuments = new ArrayList<>();
        
        for (Document doc : documents) {
            String[] paragraphs = doc.getContent().split("\\n\\s*\\n");
            
            for (int i = 0; i < paragraphs.length; i++) {
                String paragraph = paragraphs[i].trim();
                if (!paragraph.isEmpty()) {
                    Map<String, Object> metadata = new java.util.HashMap<>(doc.getMetadata());
                    metadata.put("chunk_index", i);
                    metadata.put("chunk_type", "paragraph");
                    metadata.put("parent_id", doc.getId());
                    
                    Document chunk = new Document(paragraph, metadata);
                    splitDocuments.add(chunk);
                }
            }
        }
        
        return splitDocuments;
    }
    
    /**
     * 基于句子的分割
     */
    public List<Document> splitBySentences(List<Document> documents, int sentencesPerChunk) {
        List<Document> splitDocuments = new ArrayList<>();
        Pattern sentencePattern = Pattern.compile("[.!?]+\\s*");
        
        for (Document doc : documents) {
            String[] sentences = sentencePattern.split(doc.getContent());
            
            for (int i = 0; i < sentences.length; i += sentencesPerChunk) {
                StringBuilder chunkContent = new StringBuilder();
                int endIndex = Math.min(i + sentencesPerChunk, sentences.length);
                
                for (int j = i; j < endIndex; j++) {
                    if (j > i) chunkContent.append(". ");
                    chunkContent.append(sentences[j].trim());
                }
                
                if (chunkContent.length() > 0) {
                    Map<String, Object> metadata = new java.util.HashMap<>(doc.getMetadata());
                    metadata.put("chunk_index", i / sentencesPerChunk);
                    metadata.put("chunk_type", "sentences");
                    metadata.put("sentence_count", endIndex - i);
                    metadata.put("parent_id", doc.getId());
                    
                    Document chunk = new Document(chunkContent.toString(), metadata);
                    splitDocuments.add(chunk);
                }
            }
        }
        
        return splitDocuments;
    }
    
    /**
     * 智能分割（结合多种策略）
     */
    public List<Document> smartSplit(List<Document> documents, SplitOptions options) {
        List<Document> result = new ArrayList<>();
        
        for (Document doc : documents) {
            List<Document> chunks = smartSplitDocument(doc, options);
            result.addAll(chunks);
        }
        
        return result;
    }
    
    /**
     * 智能分割单个文档
     */
    private List<Document> smartSplitDocument(Document doc, SplitOptions options) {
        String content = doc.getContent();
        List<Document> chunks = new ArrayList<>();
        
        // 首先尝试按段落分割
        String[] paragraphs = content.split("\\n\\s*\\n");
        
        StringBuilder currentChunk = new StringBuilder();
        int chunkIndex = 0;
        
        for (String paragraph : paragraphs) {
            paragraph = paragraph.trim();
            if (paragraph.isEmpty()) continue;
            
            // 检查添加这个段落是否会超过大小限制
            if (currentChunk.length() + paragraph.length() > options.getMaxChunkSize() && 
                currentChunk.length() > 0) {
                
                // 创建当前块
                Document chunk = createChunk(currentChunk.toString(), doc, chunkIndex++, "smart");
                chunks.add(chunk);
                
                // 开始新块
                currentChunk = new StringBuilder();
            }
            
            if (currentChunk.length() > 0) {
                currentChunk.append("\n\n");
            }
            currentChunk.append(paragraph);
        }
        
        // 添加最后一个块
        if (currentChunk.length() > 0) {
            Document chunk = createChunk(currentChunk.toString(), doc, chunkIndex, "smart");
            chunks.add(chunk);
        }
        
        return chunks;
    }
    
    /**
     * 按字符分割单个文档
     */
    private List<Document> splitDocumentByCharacters(Document doc, int chunkSize, int chunkOverlap) {
        List<Document> chunks = new ArrayList<>();
        String content = doc.getContent();
        
        int start = 0;
        int chunkIndex = 0;
        
        while (start < content.length()) {
            int end = Math.min(start + chunkSize, content.length());
            
            // 尝试在单词边界处分割
            if (end < content.length()) {
                int lastSpace = content.lastIndexOf(' ', end);
                if (lastSpace > start) {
                    end = lastSpace;
                }
            }
            
            String chunkContent = content.substring(start, end).trim();
            if (!chunkContent.isEmpty()) {
                Document chunk = createChunk(chunkContent, doc, chunkIndex++, "character");
                chunks.add(chunk);
            }
            
            start = end - chunkOverlap;
            if (start < 0) start = 0;
        }
        
        return chunks;
    }
    
    /**
     * 创建文档块
     */
    private Document createChunk(String content, Document originalDoc, int chunkIndex, String splitType) {
        Map<String, Object> metadata = new java.util.HashMap<>(originalDoc.getMetadata());
        metadata.put("chunk_index", chunkIndex);
        metadata.put("chunk_type", splitType);
        metadata.put("parent_id", originalDoc.getId());
        metadata.put("chunk_size", content.length());
        
        return new Document(content, metadata);
    }
    
    /**
     * 分割选项
     */
    public static class SplitOptions {
        private int maxChunkSize = 1000;
        private int chunkOverlap = 200;
        private boolean preserveParagraphs = true;
        private boolean preserveSentences = true;
        
        // Getters and Setters
        public int getMaxChunkSize() { return maxChunkSize; }
        public SplitOptions setMaxChunkSize(int maxChunkSize) { 
            this.maxChunkSize = maxChunkSize; 
            return this; 
        }
        
        public int getChunkOverlap() { return chunkOverlap; }
        public SplitOptions setChunkOverlap(int chunkOverlap) { 
            this.chunkOverlap = chunkOverlap; 
            return this; 
        }
        
        public boolean isPreserveParagraphs() { return preserveParagraphs; }
        public SplitOptions setPreserveParagraphs(boolean preserveParagraphs) { 
            this.preserveParagraphs = preserveParagraphs; 
            return this; 
        }
        
        public boolean isPreserveSentences() { return preserveSentences; }
        public SplitOptions setPreserveSentences(boolean preserveSentences) { 
            this.preserveSentences = preserveSentences; 
            return this; 
        }
    }
}

5.3.3 文档增强器

// DocumentEnhancerService.java
package com.example.springai.service;

import org.springframework.ai.document.Document;
import org.springframework.stereotype.Service;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class DocumentEnhancerService {
    
    /**
     * 增强文档元数据
     */
    public List<Document> enhanceDocuments(List<Document> documents) {
        return documents.stream()
            .map(this::enhanceDocument)
            .toList();
    }
    
    /**
     * 增强单个文档
     */
    public Document enhanceDocument(Document document) {
        String content = document.getContent();
        Map<String, Object> metadata = document.getMetadata();
        
        // 添加内容统计信息
        addContentStatistics(content, metadata);
        
        // 提取关键信息
        extractKeyInformation(content, metadata);
        
        // 添加处理时间戳
        metadata.put("enhanced_at", LocalDateTime.now().format(DateTimeFormatter.ISO_LOCAL_DATE_TIME));
        
        // 生成内容摘要
        String summary = generateSummary(content);
        metadata.put("summary", summary);
        
        // 检测语言
        String language = detectLanguage(content);
        metadata.put("language", language);
        
        // 提取标签
        List<String> tags = extractTags(content);
        metadata.put("tags", tags);
        
        return new Document(document.getId(), content, metadata);
    }
    
    /**
     * 添加内容统计信息
     */
    private void addContentStatistics(String content, Map<String, Object> metadata) {
        metadata.put("character_count", content.length());
        metadata.put("word_count", countWords(content));
        metadata.put("sentence_count", countSentences(content));
        metadata.put("paragraph_count", countParagraphs(content));
        metadata.put("line_count", countLines(content));
    }
    
    /**
     * 提取关键信息
     */
    private void extractKeyInformation(String content, Map<String, Object> metadata) {
        // 提取邮箱地址
        List<String> emails = extractEmails(content);
        if (!emails.isEmpty()) {
            metadata.put("emails", emails);
        }
        
        // 提取URL
        List<String> urls = extractUrls(content);
        if (!urls.isEmpty()) {
            metadata.put("urls", urls);
        }
        
        // 提取日期
        List<String> dates = extractDates(content);
        if (!dates.isEmpty()) {
            metadata.put("dates", dates);
        }
        
        // 提取数字
        List<String> numbers = extractNumbers(content);
        if (!numbers.isEmpty()) {
            metadata.put("numbers", numbers);
        }
    }
    
    /**
     * 生成内容摘要
     */
    private String generateSummary(String content) {
        // 简单的摘要生成：取前150个字符
        if (content.length() <= 150) {
            return content;
        }
        
        String summary = content.substring(0, 150);
        int lastSpace = summary.lastIndexOf(' ');
        if (lastSpace > 0) {
            summary = summary.substring(0, lastSpace);
        }
        
        return summary + "...";
    }
    
    /**
     * 检测语言
     */
    private String detectLanguage(String content) {
        // 简单的语言检测
        long chineseChars = content.chars()
            .filter(ch -> ch >= 0x4E00 && ch <= 0x9FFF)
            .count();
        
        long englishChars = content.chars()
            .filter(ch -> (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'))
            .count();
        
        if (chineseChars > englishChars) {
            return "zh";
        } else if (englishChars > 0) {
            return "en";
        } else {
            return "unknown";
        }
    }
    
    /**
     * 提取标签
     */
    private List<String> extractTags(String content) {
        // 简单的标签提取：基于关键词
        String[] keywords = {
            "技术", "编程", "开发", "算法", "数据", "AI", "机器学习", "深度学习",
            "technology", "programming", "development", "algorithm", "data", "artificial intelligence"
        };
        
        return java.util.Arrays.stream(keywords)
            .filter(keyword -> content.toLowerCase().contains(keyword.toLowerCase()))
            .distinct()
            .toList();
    }
    
    // 统计方法
    private int countWords(String content) {
        return content.trim().isEmpty() ? 0 : content.trim().split("\\s+").length;
    }
    
    private int countSentences(String content) {
        return content.split("[.!?]+").length;
    }
    
    private int countParagraphs(String content) {
        return content.split("\\n\\s*\\n").length;
    }
    
    private int countLines(String content) {
        return content.split("\\n").length;
    }
    
    // 提取方法
    private List<String> extractEmails(String content) {
        Pattern emailPattern = Pattern.compile("\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b");
        Matcher matcher = emailPattern.matcher(content);
        
        return matcher.results()
            .map(result -> result.group())
            .distinct()
            .toList();
    }
    
    private List<String> extractUrls(String content) {
        Pattern urlPattern = Pattern.compile("https?://[\\w\\-._~:/?#\\[\\]@!$&'()*+,;=%]+");
        Matcher matcher = urlPattern.matcher(content);
        
        return matcher.results()
            .map(result -> result.group())
            .distinct()
            .toList();
    }
    
    private List<String> extractDates(String content) {
        Pattern datePattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2}|\\d{2}/\\d{2}/\\d{4}|\\d{2}-\\d{2}-\\d{4}");
        Matcher matcher = datePattern.matcher(content);
        
        return matcher.results()
            .map(result -> result.group())
            .distinct()
            .toList();
    }
    
    private List<String> extractNumbers(String content) {
        Pattern numberPattern = Pattern.compile("\\b\\d+(?:\\.\\d+)?\\b");
        Matcher matcher = numberPattern.matcher(content);
        
        return matcher.results()
            .map(result -> result.group())
            .distinct()
            .limit(10) // 限制数量
            .toList();
    }
}

5.4 RAG（检索增强生成）实现

5.4.1 RAG服务核心

// RAGService.java
package com.example.springai.service;

import org.springframework.ai.chat.ChatModel;
import org.springframework.ai.chat.ChatResponse;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

@Service
public class RAGService {
    
    private final VectorStore vectorStore;
    private final ChatModel chatModel;
    private final PromptTemplateService promptTemplateService;
    
    public RAGService(
            VectorStore vectorStore,
            ChatModel chatModel,
            PromptTemplateService promptTemplateService
    ) {
        this.vectorStore = vectorStore;
        this.chatModel = chatModel;
        this.promptTemplateService = promptTemplateService;
    }
    
    /**
     * 基础RAG查询
     */
    public RAGResponse query(String question) {
        return query(question, new RAGOptions());
    }
    
    /**
     * 带选项的RAG查询
     */
    public RAGResponse query(String question, RAGOptions options) {
        // 1. 检索相关文档
        List<Document> relevantDocs = retrieveRelevantDocuments(question, options);
        
        // 2. 构建上下文
        String context = buildContext(relevantDocs, options);
        
        // 3. 生成提示词
        String prompt = buildRAGPrompt(question, context, options);
        
        // 4. 调用LLM生成回答
        ChatResponse response = chatModel.call(new Prompt(prompt));
        String answer = response.getResult().getOutput().getContent();
        
        // 5. 构建响应
        return new RAGResponse(
            question,
            answer,
            relevantDocs,
            context,
            response.getMetadata()
        );
    }
    
    /**
     * 检索相关文档
     */
    private List<Document> retrieveRelevantDocuments(String question, RAGOptions options) {
        SearchRequest searchRequest = SearchRequest.query(question)
            .withTopK(options.getTopK())
            .withSimilarityThreshold(options.getSimilarityThreshold());
        
        // 添加过滤器
        if (options.getMetadataFilter() != null) {
            searchRequest = searchRequest.withFilterExpression(options.getMetadataFilter());
        }
        
        return vectorStore.similaritySearch(searchRequest);
    }
    
    /**
     * 构建上下文
     */
    private String buildContext(List<Document> documents, RAGOptions options) {
        if (documents.isEmpty()) {
            return "没有找到相关信息。";
        }
        
        StringBuilder contextBuilder = new StringBuilder();
        
        for (int i = 0; i < documents.size(); i++) {
            Document doc = documents.get(i);
            
            contextBuilder.append("[文档 ").append(i + 1).append("]\n");
            
            // 添加元数据信息（如果需要）
            if (options.isIncludeMetadata()) {
                Map<String, Object> metadata = doc.getMetadata();
                if (metadata.containsKey("source_file")) {
                    contextBuilder.append("来源：").append(metadata.get("source_file")).append("\n");
                }
            }
            
            contextBuilder.append(doc.getContent());
            
            if (i < documents.size() - 1) {
                contextBuilder.append("\n\n");
            }
        }
        
        return contextBuilder.toString();
    }
    
    /**
     * 构建RAG提示词
     */
    private String buildRAGPrompt(String question, String context, RAGOptions options) {
        String template = """
            你是一个专业的AI助手。请基于以下提供的上下文信息来回答用户的问题。
            
            上下文信息：
            {context}
            
            用户问题：{question}
            
            回答要求：
            1. 仅基于提供的上下文信息回答
            2. 如果上下文中没有相关信息，请明确说明
            3. 保持回答的准确性和客观性
            4. 如果可能，请引用具体的文档来源
            
            请提供详细的回答：
            """;
        
        PromptTemplate promptTemplate = new PromptTemplate(template);
        Map<String, Object> variables = Map.of(
            "context", context,
            "question", question
        );
        
        return promptTemplate.create(variables).getContents();
    }
    
    /**
     * 多轮对话RAG
     */
    public RAGResponse conversationalQuery(String question, List<ConversationTurn> conversationHistory, RAGOptions options) {
        // 构建对话上下文
        String conversationContext = buildConversationContext(conversationHistory);
        
        // 结合对话历史重写查询
        String rewrittenQuery = rewriteQueryWithContext(question, conversationContext);
        
        // 执行RAG查询
        RAGResponse response = query(rewrittenQuery, options);
        
        // 更新响应以包含对话上下文
        response.setConversationContext(conversationContext);
        response.setRewrittenQuery(rewrittenQuery);
        
        return response;
    }
    
    /**
     * 构建对话上下文
     */
    private String buildConversationContext(List<ConversationTurn> history) {
        return history.stream()
            .map(turn -> "用户：" + turn.getQuestion() + "\n助手：" + turn.getAnswer())
            .collect(Collectors.joining("\n\n"));
    }
    
    /**
     * 基于上下文重写查询
     */
    private String rewriteQueryWithContext(String question, String conversationContext) {
        if (conversationContext.isEmpty()) {
            return question;
        }
        
        String template = """
            基于以下对话历史，请重写用户的最新问题，使其更加完整和明确：
            
            对话历史：
            {context}
            
            最新问题：{question}
            
            重写后的问题（保持简洁明确）：
            """;
        
        PromptTemplate promptTemplate = new PromptTemplate(template);
        Map<String, Object> variables = Map.of(
            "context", conversationContext,
            "question", question
        );
        
        String prompt = promptTemplate.create(variables).getContents();
        ChatResponse response = chatModel.call(new Prompt(prompt));
        
        return response.getResult().getOutput().getContent().trim();
    }
    
    /**
     * RAG选项
     */
    public static class RAGOptions {
        private int topK = 5;
        private double similarityThreshold = 0.7;
        private boolean includeMetadata = false;
        private Filter metadataFilter;
        
        // Getters and Setters
        public int getTopK() { return topK; }
        public RAGOptions setTopK(int topK) { this.topK = topK; return this; }
        
        public double getSimilarityThreshold() { return similarityThreshold; }
        public RAGOptions setSimilarityThreshold(double threshold) { 
            this.similarityThreshold = threshold; 
            return this; 
        }
        
        public boolean isIncludeMetadata() { return includeMetadata; }
        public RAGOptions setIncludeMetadata(boolean includeMetadata) { 
            this.includeMetadata = includeMetadata; 
            return this; 
        }
        
        public Filter getMetadataFilter() { return metadataFilter; }
        public RAGOptions setMetadataFilter(Filter metadataFilter) { 
            this.metadataFilter = metadataFilter; 
            return this; 
        }
    }
    
    /**
     * RAG响应
     */
    public static class RAGResponse {
        private String question;
        private String answer;
        private List<Document> sourceDocuments;
        private String context;
        private Map<String, Object> metadata;
        private String conversationContext;
        private String rewrittenQuery;
        
        public RAGResponse(String question, String answer, List<Document> sourceDocuments, 
                          String context, Map<String, Object> metadata) {
            this.question = question;
            this.answer = answer;
            this.sourceDocuments = sourceDocuments;
            this.context = context;
            this.metadata = metadata;
        }
        
        // Getters and Setters
        public String getQuestion() { return question; }
        public String getAnswer() { return answer; }
        public List<Document> getSourceDocuments() { return sourceDocuments; }
        public String getContext() { return context; }
        public Map<String, Object> getMetadata() { return metadata; }
        public String getConversationContext() { return conversationContext; }
        public void setConversationContext(String conversationContext) { 
            this.conversationContext = conversationContext; 
        }
        public String getRewrittenQuery() { return rewrittenQuery; }
        public void setRewrittenQuery(String rewrittenQuery) { 
            this.rewrittenQuery = rewrittenQuery; 
        }
    }
    
    /**
     * 对话轮次
     */
    public static class ConversationTurn {
        private String question;
        private String answer;
        
        public ConversationTurn(String question, String answer) {
            this.question = question;
            this.answer = answer;
        }
        
        public String getQuestion() { return question; }
        public String getAnswer() { return answer; }
    }
}

5.4.2 高级RAG功能

// AdvancedRAGService.java
package com.example.springai.service;

import org.springframework.ai.chat.ChatModel;
import org.springframework.ai.chat.ChatResponse;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.document.Document;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.stream.Collectors;

@Service
public class AdvancedRAGService {
    
    private final RAGService ragService;
    private final ChatModel chatModel;
    
    public AdvancedRAGService(RAGService ragService, ChatModel chatModel) {
        this.ragService = ragService;
        this.chatModel = chatModel;
    }
    
    /**
     * 混合检索（关键词 + 语义）
     */
    public RAGService.RAGResponse hybridSearch(String question, HybridSearchOptions options) {
        // 1. 语义检索
        RAGService.RAGOptions semanticOptions = new RAGService.RAGOptions()
            .setTopK(options.getSemanticTopK())
            .setSimilarityThreshold(options.getSemanticThreshold());
        
        RAGService.RAGResponse semanticResponse = ragService.query(question, semanticOptions);
        List<Document> semanticDocs = semanticResponse.getSourceDocuments();
        
        // 2. 关键词检索（简化实现）
        List<Document> keywordDocs = performKeywordSearch(question, options);
        
        // 3. 合并和重排序
        List<Document> mergedDocs = mergeAndRerankDocuments(semanticDocs, keywordDocs, question);
        
        // 4. 生成最终回答
        String context = buildHybridContext(mergedDocs);
        String answer = generateAnswerWithHybridContext(question, context);
        
        return new RAGService.RAGResponse(
            question,
            answer,
            mergedDocs,
            context,
            Map.of("search_type", "hybrid")
        );
    }
    
    /**
     * 多步推理RAG
     */
    public MultiStepRAGResponse multiStepReasoning(String question, int maxSteps) {
        List<ReasoningStep> steps = new ArrayList<>();
        String currentQuestion = question;
        
        for (int step = 0; step < maxSteps; step++) {
            // 分析当前问题
            QuestionAnalysis analysis = analyzeQuestion(currentQuestion);
            
            if (analysis.isComplete()) {
                // 问题可以直接回答
                RAGService.RAGResponse response = ragService.query(currentQuestion);
                steps.add(new ReasoningStep(step + 1, currentQuestion, response.getAnswer(), true));
                break;
            } else {
                // 需要分解问题
                String subQuestion = analysis.getNextSubQuestion();
                RAGService.RAGResponse subResponse = ragService.query(subQuestion);
                steps.add(new ReasoningStep(step + 1, subQuestion, subResponse.getAnswer(), false));
                
                // 更新问题上下文
                currentQuestion = updateQuestionWithAnswer(currentQuestion, subQuestion, subResponse.getAnswer());
            }
        }
        
        // 生成最终答案
        String finalAnswer = synthesizeFinalAnswer(question, steps);
        
        return new MultiStepRAGResponse(question, finalAnswer, steps);
    }
    
    /**
     * 自适应检索
     */
    public RAGService.RAGResponse adaptiveRetrieval(String question) {
        // 1. 评估问题复杂度
        QuestionComplexity complexity = assessQuestionComplexity(question);
        
        // 2. 根据复杂度调整检索策略
        RAGService.RAGOptions options = new RAGService.RAGOptions();
        
        switch (complexity) {
            case SIMPLE:
                options.setTopK(3).setSimilarityThreshold(0.8);
                break;
            case MEDIUM:
                options.setTopK(5).setSimilarityThreshold(0.7);
                break;
            case COMPLEX:
                options.setTopK(8).setSimilarityThreshold(0.6);
                break;
        }
        
        // 3. 执行检索
        RAGService.RAGResponse response = ragService.query(question, options);
        
        // 4. 评估答案质量
        double answerQuality = evaluateAnswerQuality(question, response.getAnswer());
        
        // 5. 如果质量不够，尝试扩展检索
        if (answerQuality < 0.7 && complexity != QuestionComplexity.COMPLEX) {
            options.setTopK(options.getTopK() + 3)
                   .setSimilarityThreshold(options.getSimilarityThreshold() - 0.1);
            response = ragService.query(question, options);
        }
        
        return response;
    }
    
    /**
     * 执行关键词搜索
     */
    private List<Document> performKeywordSearch(String question, HybridSearchOptions options) {
        // 简化的关键词搜索实现
        // 在实际应用中，这里应该使用专门的全文搜索引擎如Elasticsearch
        
        String[] keywords = extractKeywords(question);
        // 这里返回空列表，实际实现需要根据具体的搜索引擎来实现
        return new ArrayList<>();
    }
    
    /**
     * 提取关键词
     */
    private String[] extractKeywords(String text) {
        // 简化的关键词提取
        return text.toLowerCase()
            .replaceAll("[^a-zA-Z0-9\\u4e00-\\u9fa5\\s]", "")
            .split("\\s+");
    }
    
    /**
     * 合并和重排序文档
     */
    private List<Document> mergeAndRerankDocuments(List<Document> semanticDocs, 
                                                   List<Document> keywordDocs, 
                                                   String question) {
        // 合并文档并去重
        Set<String> seenIds = new HashSet<>();
        List<Document> merged = new ArrayList<>();
        
        // 添加语义搜索结果（优先级更高）
        for (Document doc : semanticDocs) {
            if (seenIds.add(doc.getId())) {
                merged.add(doc);
            }
        }
        
        // 添加关键词搜索结果
        for (Document doc : keywordDocs) {
            if (seenIds.add(doc.getId())) {
                merged.add(doc);
            }
        }
        
        // 重排序（这里使用简化的评分机制）
        return merged.stream()
            .sorted((d1, d2) -> Double.compare(
                calculateRelevanceScore(d2, question),
                calculateRelevanceScore(d1, question)
            ))
            .limit(10)
            .collect(Collectors.toList());
    }
    
    /**
     * 计算相关性评分
     */
    private double calculateRelevanceScore(Document document, String question) {
        String content = document.getContent().toLowerCase();
        String[] questionWords = question.toLowerCase().split("\\s+");
        
        double score = 0.0;
        for (String word : questionWords) {
            if (content.contains(word)) {
                score += 1.0;
            }
        }
        
        // 归一化评分
        return score / questionWords.length;
    }
    
    /**
     * 构建混合上下文
     */
    private String buildHybridContext(List<Document> documents) {
        StringBuilder context = new StringBuilder();
        
        for (int i = 0; i < documents.size(); i++) {
            Document doc = documents.get(i);
            context.append("[文档 ").append(i + 1).append("]\n");
            context.append(doc.getContent());
            
            if (i < documents.size() - 1) {
                context.append("\n\n");
            }
        }
        
        return context.toString();
    }
    
    /**
     * 使用混合上下文生成答案
     */
    private String generateAnswerWithHybridContext(String question, String context) {
        String template = """
            基于以下混合检索的上下文信息，请回答用户问题：
            
            上下文：
            {context}
            
            问题：{question}
            
            请提供准确、详细的回答：
            """;
        
        String prompt = template.replace("{context}", context).replace("{question}", question);
        ChatResponse response = chatModel.call(new Prompt(prompt));
        
        return response.getResult().getOutput().getContent();
    }
    
    /**
     * 分析问题
     */
    private QuestionAnalysis analyzeQuestion(String question) {
        // 简化的问题分析
        String[] complexIndicators = {"如何", "为什么", "比较", "分析", "解释", "步骤"};
        
        boolean isComplex = Arrays.stream(complexIndicators)
            .anyMatch(indicator -> question.contains(indicator));
        
        return new QuestionAnalysis(question, !isComplex, 
            isComplex ? "请详细解释" + question : null);
    }
    
    /**
     * 更新问题上下文
     */
    private String updateQuestionWithAnswer(String originalQuestion, String subQuestion, String subAnswer) {
        return originalQuestion + "\n\n已知：" + subQuestion + " -> " + subAnswer;
    }
    
    /**
     * 综合最终答案
     */
    private String synthesizeFinalAnswer(String originalQuestion, List<ReasoningStep> steps) {
        StringBuilder synthesis = new StringBuilder();
        synthesis.append("基于多步推理，针对问题：").append(originalQuestion).append("\n\n");
        
        for (ReasoningStep step : steps) {
            synthesis.append("步骤 ").append(step.getStepNumber()).append(": ")
                    .append(step.getQuestion()).append("\n")
                    .append("答案: ").append(step.getAnswer()).append("\n\n");
        }
        
        synthesis.append("综合结论：\n");
        
        // 使用LLM综合最终答案
        String template = """
            基于以下推理步骤，请为原始问题提供综合性的最终答案：
            
            原始问题：{question}
            
            推理过程：
            {steps}
            
            最终答案：
            """;
        
        String prompt = template.replace("{question}", originalQuestion)
                               .replace("{steps}", synthesis.toString());
        
        ChatResponse response = chatModel.call(new Prompt(prompt));
        return response.getResult().getOutput().getContent();
    }
    
    /**
     * 评估问题复杂度
     */
    private QuestionComplexity assessQuestionComplexity(String question) {
        int complexityScore = 0;
        
        // 检查复杂度指标
        String[] complexWords = {"如何", "为什么", "比较", "分析", "详细", "步骤", "过程"};
        for (String word : complexWords) {
            if (question.contains(word)) {
                complexityScore++;
            }
        }
        
        // 检查问题长度
        if (question.length() > 50) complexityScore++;
        if (question.length() > 100) complexityScore++;
        
        // 检查是否包含多个子问题
        if (question.contains("？") || question.contains("?")) {
            long questionMarks = question.chars().filter(ch -> ch == '？' || ch == '?').count();
            if (questionMarks > 1) complexityScore += 2;
        }
        
        if (complexityScore <= 1) return QuestionComplexity.SIMPLE;
        if (complexityScore <= 3) return QuestionComplexity.MEDIUM;
        return QuestionComplexity.COMPLEX;
    }
    
    /**
     * 评估答案质量
     */
    private double evaluateAnswerQuality(String question, String answer) {
        // 简化的答案质量评估
        double score = 0.0;
        
        // 检查答案长度
        if (answer.length() > 50) score += 0.2;
        if (answer.length() > 200) score += 0.2;
        
        // 检查是否包含问题关键词
        String[] questionWords = question.toLowerCase().split("\\s+");
        String answerLower = answer.toLowerCase();
        
        int matchedWords = 0;
        for (String word : questionWords) {
            if (answerLower.contains(word)) {
                matchedWords++;
            }
        }
        
        score += (double) matchedWords / questionWords.length * 0.4;
        
        // 检查是否包含否定词汇（可能表示无法回答）
        String[] negativeWords = {"不知道", "无法", "没有信息", "不清楚"};
        boolean hasNegative = Arrays.stream(negativeWords)
            .anyMatch(answerLower::contains);
        
        if (hasNegative) score -= 0.3;
        
        return Math.max(0.0, Math.min(1.0, score));
    }
    
    // 内部类定义
    public static class HybridSearchOptions {
        private int semanticTopK = 5;
        private double semanticThreshold = 0.7;
        private int keywordTopK = 5;
        
        // Getters and Setters
        public int getSemanticTopK() { return semanticTopK; }
        public HybridSearchOptions setSemanticTopK(int topK) { 
            this.semanticTopK = topK; 
            return this; 
        }
        
        public double getSemanticThreshold() { return semanticThreshold; }
        public HybridSearchOptions setSemanticThreshold(double threshold) { 
            this.semanticThreshold = threshold; 
            return this; 
        }
        
        public int getKeywordTopK() { return keywordTopK; }
        public HybridSearchOptions setKeywordTopK(int topK) { 
            this.keywordTopK = topK; 
            return this; 
        }
    }
    
    public static class QuestionAnalysis {
        private String question;
        private boolean isComplete;
        private String nextSubQuestion;
        
        public QuestionAnalysis(String question, boolean isComplete, String nextSubQuestion) {
            this.question = question;
            this.isComplete = isComplete;
            this.nextSubQuestion = nextSubQuestion;
        }
        
        public String getQuestion() { return question; }
        public boolean isComplete() { return isComplete; }
        public String getNextSubQuestion() { return nextSubQuestion; }
    }
    
    public static class ReasoningStep {
        private int stepNumber;
        private String question;
        private String answer;
        private boolean isFinal;
        
        public ReasoningStep(int stepNumber, String question, String answer, boolean isFinal) {
            this.stepNumber = stepNumber;
            this.question = question;
            this.answer = answer;
            this.isFinal = isFinal;
        }
        
        public int getStepNumber() { return stepNumber; }
        public String getQuestion() { return question; }
        public String getAnswer() { return answer; }
        public boolean isFinal() { return isFinal; }
    }
    
    public static class MultiStepRAGResponse {
        private String originalQuestion;
        private String finalAnswer;
        private List<ReasoningStep> reasoningSteps;
        
        public MultiStepRAGResponse(String originalQuestion, String finalAnswer, List<ReasoningStep> reasoningSteps) {
            this.originalQuestion = originalQuestion;
            this.finalAnswer = finalAnswer;
            this.reasoningSteps = reasoningSteps;
        }
        
        public String getOriginalQuestion() { return originalQuestion; }
        public String getFinalAnswer() { return finalAnswer; }
        public List<ReasoningStep> getReasoningSteps() { return reasoningSteps; }
    }
    
    public enum QuestionComplexity {
        SIMPLE, MEDIUM, COMPLEX
    }
}

5.5 向量数据库管理控制器

// VectorStoreController.java
package com.example.springai.controller;

import com.example.springai.service.*;
import org.springframework.ai.document.Document;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.Map;

@RestController
@RequestMapping("/api/vector-store")
public class VectorStoreController {
    
    private final DocumentLoaderService documentLoaderService;
    private final DocumentSplitterService documentSplitterService;
    private final DocumentEnhancerService documentEnhancerService;
    private final ChromaVectorService chromaVectorService;
    private final RAGService ragService;
    private final AdvancedRAGService advancedRAGService;
    
    public VectorStoreController(
            DocumentLoaderService documentLoaderService,
            DocumentSplitterService documentSplitterService,
            DocumentEnhancerService documentEnhancerService,
            ChromaVectorService chromaVectorService,
            RAGService ragService,
            AdvancedRAGService advancedRAGService
    ) {
        this.documentLoaderService = documentLoaderService;
        this.documentSplitterService = documentSplitterService;
        this.documentEnhancerService = documentEnhancerService;
        this.chromaVectorService = chromaVectorService;
        this.ragService = ragService;
        this.advancedRAGService = advancedRAGService;
    }
    
    /**
     * 上传并处理文档
     */
    @PostMapping("/upload")
    public ResponseEntity<Map<String, Object>> uploadDocument(
            @RequestParam("file") MultipartFile file,
            @RequestParam(value = "chunkSize", defaultValue = "1000") int chunkSize,
            @RequestParam(value = "chunkOverlap", defaultValue = "200") int chunkOverlap
    ) {
        try {
            // 1. 保存临时文件
            Path tempFile = Files.createTempFile("upload_", "_" + file.getOriginalFilename());
            file.transferTo(tempFile.toFile());
            
            // 2. 加载文档
            org.springframework.core.io.Resource resource = 
                new org.springframework.core.io.FileSystemResource(tempFile);
            List<Document> documents = documentLoaderService.loadTextFile(resource);
            
            // 3. 分割文档
            List<Document> splitDocuments = documentSplitterService
                .splitByCharacters(documents, chunkSize, chunkOverlap);
            
            // 4. 增强文档
            List<Document> enhancedDocuments = documentEnhancerService
                .enhanceDocuments(splitDocuments);
            
            // 5. 添加到向量数据库
            chromaVectorService.addDocuments(enhancedDocuments);
            
            // 6. 清理临时文件
            Files.deleteIfExists(tempFile);
            
            return ResponseEntity.ok(Map.of(
                "message", "文档上传成功",
                "documentCount", enhancedDocuments.size(),
                "filename", file.getOriginalFilename()
            ));
            
        } catch (IOException e) {
            return ResponseEntity.badRequest().body(Map.of(
                "error", "文件处理失败: " + e.getMessage()
            ));
        }
    }
    
    /**
     * 语义搜索
     */
    @PostMapping("/search")
    public ResponseEntity<List<Document>> semanticSearch(
            @RequestBody Map<String, Object> request
    ) {
        String query = (String) request.get("query");
        int topK = (Integer) request.getOrDefault("topK", 5);
        
        List<Document> results = chromaVectorService.semanticSearch(query, topK);
        return ResponseEntity.ok(results);
    }
    
    /**
     * RAG查询
     */
    @PostMapping("/rag/query")
    public ResponseEntity<RAGService.RAGResponse> ragQuery(
            @RequestBody Map<String, Object> request
    ) {
        String question = (String) request.get("question");
        
        RAGService.RAGOptions options = new RAGService.RAGOptions();
        if (request.containsKey("topK")) {
            options.setTopK((Integer) request.get("topK"));
        }
        if (request.containsKey("similarityThreshold")) {
            options.setSimilarityThreshold((Double) request.get("similarityThreshold"));
        }
        
        RAGService.RAGResponse response = ragService.query(question, options);
        return ResponseEntity.ok(response);
    }
    
    /**
     * 混合搜索
     */
    @PostMapping("/rag/hybrid")
    public ResponseEntity<RAGService.RAGResponse> hybridSearch(
            @RequestBody Map<String, Object> request
    ) {
        String question = (String) request.get("question");
        
        AdvancedRAGService.HybridSearchOptions options = 
            new AdvancedRAGService.HybridSearchOptions();
        
        if (request.containsKey("semanticTopK")) {
            options.setSemanticTopK((Integer) request.get("semanticTopK"));
        }
        
        RAGService.RAGResponse response = advancedRAGService.hybridSearch(question, options);
        return ResponseEntity.ok(response);
    }
    
    /**
     * 多步推理
     */
    @PostMapping("/rag/multi-step")
    public ResponseEntity<AdvancedRAGService.MultiStepRAGResponse> multiStepReasoning(
            @RequestBody Map<String, Object> request
    ) {
        String question = (String) request.get("question");
        int maxSteps = (Integer) request.getOrDefault("maxSteps", 3);
        
        AdvancedRAGService.MultiStepRAGResponse response = 
            advancedRAGService.multiStepReasoning(question, maxSteps);
        
        return ResponseEntity.ok(response);
    }
    
    /**
     * 删除文档
     */
    @DeleteMapping("/documents")
    public ResponseEntity<Map<String, Object>> deleteDocuments(
            @RequestBody List<String> documentIds
    ) {
        boolean success = chromaVectorService.deleteDocuments(documentIds);
        
        return ResponseEntity.ok(Map.of(
            "success", success,
            "deletedCount", documentIds.size()
        ));
    }
}

5.6 配置文件

# application.yml
spring:
  ai:
    vectorstore:
      chroma:
        host: localhost
        port: 8000
        collection-name: spring_ai_docs
      pinecone:
        api-key: ${PINECONE_API_KEY}
        environment: ${PINECONE_ENVIRONMENT}
        project-id: ${PINECONE_PROJECT_ID}
        index-name: spring-ai-index
      redis:
        host: localhost
        port: 6379
        index-name: spring_ai_index
        prefix: "doc:"
    
    embedding:
      openai:
        api-key: ${OPENAI_API_KEY}
        model: text-embedding-ada-002
    
    chat:
      openai:
        api-key: ${OPENAI_API_KEY}
        model: gpt-3.5-turbo

# 文档处理配置
document:
  processing:
    default-chunk-size: 1000
    default-chunk-overlap: 200
    max-file-size: 10MB
    supported-formats:
      - txt
      - pdf
      - json
      - md

# RAG配置
rag:
  default:
    top-k: 5
    similarity-threshold: 0.7
    include-metadata: false
  advanced:
    max-reasoning-steps: 5
    answer-quality-threshold: 0.7

本章总结

本章深入介绍了Spring AI中的向量数据库与文档处理功能：

核心要点

向量数据库基础
- 向量存储抽象接口
- 文档和搜索请求模型
- 相似性搜索机制
多种向量数据库集成
- Chroma数据库集成
- Pinecone云服务集成
- Redis向量存储集成
文档处理流水线
- 多格式文档加载器
- 智能文档分割策略
- 文档元数据增强
RAG系统实现
- 基础RAG查询流程
- 多轮对话支持
- 高级RAG功能（混合检索、多步推理、自适应检索）
REST API接口
- 文档上传和处理
- 语义搜索接口
- RAG查询接口

最佳实践

文档分割策略
- 根据内容类型选择合适的分割方法
- 保持语义完整性
- 设置合理的重叠区域
向量数据库选择
- 开发环境使用Chroma
- 生产环境考虑Pinecone或专业向量数据库
- 根据数据规模选择合适的存储方案
RAG优化
- 调整检索参数以平衡准确性和召回率
- 使用混合检索提高结果质量
- 实现答案质量评估机制

练习题

实现一个支持多种文件格式的文档加载器
设计一个文档版本管理系统
创建一个RAG系统的性能监控面板
实现文档的增量更新机制
开发一个智能问答系统的评估框架

📂 分类导航

▶ 学与练
- ▶ 软件技术基础
  - ▶ 操作系统技术
    - Linux实战
    - ▶ Linux技巧
      - debug-remote-api.md
  - ▶ 容器化与编排
    - Docker实战
    - ▶ Docker高级
- ▶ 前端开发技术
  - ▶ 框架与库
    - js
    - vue
  - ▶ 前端生态
    - bootstrap
    - vue-ssr
- ▶ 后端开发技术
  - ▶ 编程语言
    - ▶ Java
    - ▶ Go
      - go-server.md
      - mini.md
    - Rust
    - Python
    - csharp
  - ▶ 中间件
    - redis
    - ▶ minio
      - minio.md
    - elasticsearch
    - kafka
    - elk
    - caddy
  - ▶ 数据库
    - MySQL
    - SQLServer
    - ▶ Dameng
      - sql.md
    - clickhouse
- ▶ 数据开发与运维
  - ▶ 数据开发
    - hadoop
  - ▶ 运维开发
    - ▶ CI/CD
      - jenkins
    - ▶ 自动化
      - allinssl.md
    - ▶ 日志处理
      - elk
    - ▶ 监控
- 软件速学教程
▶ 软件园
- AI智能体与应用
- 开发工具与环境
- AI 开发和编排
- 业务与生产力应用
- 数据和中间件
▶ 工具箱
- 内容管理
- 编码解码
- ▶ 系统监控
  - miaotixing.md
- ▶ 日常工具
- 工具命令
- 使用教程

📚 向量数据库与文档处理