全文搜索服务器-Solr-FreeOA

全文搜索服务器-Solr

2013-11-18 14:04:23

阿炯

Apache Solr(读音: SOLer)是一个开源的搜索服务器。使用 Java 语言开发，主要基于 HTTP 和 Apache Lucene 实现。Solr 中存储的资源是以 Document 为对象进行存储的。每个文档由一系列的 Field 构成，每个 Field 表示资源的一个属性。Solr 中的每个 Document 需要有能唯一标识其自身的属性，默认情况下这个属性的名字是 id，在 Schema 配置文件中使用：<uniqueKey>id</uniqueKey>进行描述。Solr 是基于 Lucene 的全文搜索服务器，也是最流行的企业级搜索引擎，其主要功能包括全文检索、命中高亮、分面搜索、动态聚类、数据库集成，以及对富文本(如 Word、PDF)的处理。Solr 高度可扩展，并提供了分布式搜索和索引复制。采用Apache协议授权。

Solr 的全文搜索的文档通过Http利用XML加到一个搜索集合中，查询该集合也是通过 http收到一个XML/JSON响应来实现。它的主要特性包括：高效、灵活的缓存功能，垂直搜索功能，高亮显示搜索结果，通过索引复制来提高可用性，提供一套强大Data Schema来定义字段，类型和设置文本分析，提供基于Web的管理界面等。

Solr is the popular, blazing fast open source enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Features
Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.
Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces - XML, JSON and HTTP
Comprehensive HTML Administration Interfaces
Server statistics exposed over JMX for monitoring
Linearly scalable, auto index replication, auto failover and recovery
Near Real-time indexing
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture

Detailed Features
Schema
Defines the field types and fields of documents
Can drive more intelligent processing
Declarative Lucene Analyzer specification
Dynamic Fields enables on-the-fly addition of new fields
CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
Explicit types eliminates the need for guessing types of fields
External file-based configuration of stopword lists, synonym lists, and protected word lists
Many additional text analysis components including word splitting, regex and sounds-like filters
Pluggable similarity model per field

Query
HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
Sort by any number of fields, and by complex functions of numeric fields
Advanced DisMax query parser for high relevancy results from user-entered queries
Highlighted context snippets
Faceted Searching based on unique field values, explicit queries, date ranges, numeric ranges or pivot
Multi-Select Faceting by tagging and selectively excluding filters
Spelling suggestions for user queries
More Like This suggestions for given document
Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores.
Range filter over Function Query results
Date Math - specify dates relative to "NOW" in queries and updates
Dynamic search results clustering using Carrot2
Numeric field statistics such as min, max, average, standard deviation
Combine queries derived from different syntaxes
Auto-suggest functionality for completing user queries
Allow configuration of top results for a query, overriding normal scoring and sorting
Simple join capability between two document types
Performance Optimizations

Core
Dynamically create and delete document collections without restarting
Pluggable query handlers and extensible XML data format
Pluggable user functions for Function Query
Customizable component based request handler with distributed search support
Document uniqueness enforcement based on unique key field
Duplicate document detection, including fuzzy near duplicates
Custom index processing chains, allowing document manipulation before indexing
User configurable commands triggered on index changes
Ability to control where docs with the sort field missing will be placed
"Luke" request handler for corpus information

Caching
Configurable Query Result, Filter, and Document cache instances
Pluggable Cache implementations, including a lock free, high concurrency implementation
Cache warming in background
When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
Autowarming in background
The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabling high cache hit rates across index/searcher changes.
Fast/small filter implementation
User level caching with autowarming support

SolrCloud
Centralized Apache ZooKeeper based configuration
Automated distributed indexing/sharding - send documents to any node and it will be forwarded to correct shard
Near Real-Time indexing with immediate push-based replication (also support for slower pull-based replication)
Transaction log ensures no updates are lost even if the documents are not yet indexed to disk
Automated query failover, index leader election and recovery in case of failure
No single point of failure

Admin Interface
Comprehensive statistics on cache utilization, updates, and queries
Interactive schema browser that includes index statistics
Replication monitoring
SolrCloud dashboard with graphical cluster node status
Full logging control
Text analysis debugger, showing result of every stage in an analyzer
Web Query Interface w/ debugging output
Parsed query output
Lucene explain() document score detailing
Explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.

最新版本：5.2
包含很多其他新特性和优化以及bug修复，详见主页。

最新版本：8.0
Lucene PMC 宣布推出 Apache Solr 的最新版本 8.0.0。新版值得关注的更新内容是：Solr 的节点间通信现在使用 HTTP/2。移除了许多已弃用的 API，更改了各种参数默认值和行为。某些更改可能需要重新索引您的内容。新版本还包括了其他许多新特性，以及 Apache Lucene 新版本的一些优化和补丁。更多详情可查阅更新日志。

最新版本：8.8
Apache Solr 8.8.0 已于2021年2月2日发布，部分更新内容：
用于识别“ Solr Home”的内部逻辑已经重构，以减少测试的出错率。使用 SolrPaths.locateSolrHome() 或 'new SolrResourceLoader' 的插件开发人员应检查弃用警告，因为在9.0中将删除现有的某些现有功能
添加用于 configSet 上传的 v2 API，包括插入单个文件
在学习排名中添加交织支持
减少使用每个副本的状态监督瓶颈
通过在关闭核心之前删除选举节点来减少节点关闭时的领导者选举时间
为 Prometheus Exporter bin 脚本添加 env var 选项
使 JSON Facets 可扩展
重构架构加载以不使用 XPath
改进日志时间戳处理等

项目主页：http://lucene.apache.org/solr/

该文章最后由阿炯于 2021-02-02 14:49:17 更新，目前是第 2 版。