Episode 117: Full Text Search Faceoff Show podcast

13+ y ago 34:26

Поширити

Вміст надано Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn. Весь вміст подкастів, включаючи епізоди, графіку та описи подкастів, завантажується та надається безпосередньо компанією Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn або його партнером по платформі подкастів. Якщо ви вважаєте, що хтось використовує ваш захищений авторським правом твір без вашого дозволу, ви можете виконати процедуру, описану тут https://uk.player.fm/legal.

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

Yikerz! – Super fun magnet game

Webapps – 16:12

Surfboard – Flipboard as a web app
InstaLyrics – Find lyrics quickly

Full Text Search – 22:11

Options
- Google Custom Search
  - Commercial
  - Benefits
    - Super fast to setup
    - Easy to implement
    - Ability to add adsense into search results
  - Downsides
    - Unable to adjust content ranking and do custom integration
    - Mainly for just indexing HTML pages, not search queries and other text.
- Sphinx
  - “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
  - Open source with commercial support
  - Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  - The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
  - API for:
    - Java, PHP, Python, Ruby, Perl, C, and other languages.
  - Written in C++
  - Stats
    - 60+ MB/sec per server
    - 500+ queries/sec
    - Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
  - Companies using Sphinx
    - Craigslist
    - Slashdot
    - Mozilla
    - WordPress.org
- Lucene
  - Done by the Apache foundation
  - Open source
  - Written in Java
  - Search types
    - ranked searching — best results returned first
    - many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
    - fielded searching (e.g., title, author, contents)
    - date-range searching
    - sorting by any field
    - multiple-index searching with merged results
    - allows simultaneous update and searching
  - Stats
    - over 95GB/hour on modern hardware
    - small RAM requirements — only 1MB heap
    - index size roughly 20-30% the size of text indexed
- Solr
  - Lucene is a library where Solr is a server that supports XML, REST
  - Benefits over Sphinx
    - Solr is easily embeddable in Java applications.
    - Solr can be integrated with Hadoop to build distributed applications
    - Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
  - Companies using Solr
    - eHarmony
    - Ticketmaster
    - Digg
    - AOL
    - Zappos

10 епізодів

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

Yikerz! – Super fun magnet game

Webapps – 16:12

Surfboard – Flipboard as a web app
InstaLyrics – Find lyrics quickly

Full Text Search – 22:11

Options
- Google Custom Search
  - Commercial
  - Benefits
    - Super fast to setup
    - Easy to implement
    - Ability to add adsense into search results
  - Downsides
    - Unable to adjust content ranking and do custom integration
    - Mainly for just indexing HTML pages, not search queries and other text.
- Sphinx
  - “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
  - Open source with commercial support
  - Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  - The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
  - API for:
    - Java, PHP, Python, Ruby, Perl, C, and other languages.
  - Written in C++
  - Stats
    - 60+ MB/sec per server
    - 500+ queries/sec
    - Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
  - Companies using Sphinx
    - Craigslist
    - Slashdot
    - Mozilla
    - WordPress.org
- Lucene
  - Done by the Apache foundation
  - Open source
  - Written in Java
  - Search types
    - ranked searching — best results returned first
    - many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
    - fielded searching (e.g., title, author, contents)
    - date-range searching
    - sorting by any field
    - multiple-index searching with merged results
    - allows simultaneous update and searching
  - Stats
    - over 95GB/hour on modern hardware
    - small RAM requirements — only 1MB heap
    - index size roughly 20-30% the size of text indexed
- Solr
  - Lucene is a library where Solr is a server that supports XML, REST
  - Benefits over Sphinx
    - Solr is easily embeddable in Java applications.
    - Solr can be integrated with Hadoop to build distributed applications
    - Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
  - Companies using Solr
    - eHarmony
    - Ticketmaster
    - Digg
    - AOL
    - Zappos

Подкасти, які варто послухати

Faceoff Show «
Episode 117: Full Text Search

News and Follow/Ups – 01:00

Geek Tools – 14:13

Webapps – 16:12

Full Text Search – 22:11