CommonCrawl Extractor 1.0 documentation

Contents:

  • Installation
  • Quick Start Guide
    • Quick Overview
    • Quickstart
    • Artemis Queue
  • API
    • Aggregator
      • Aggregator.App
        • Aggregator.App.index_query
        • Aggregator.App.ndjson_decoder
        • Aggregator.App.utils
      • Aggregator.aggregator
    • Processor
      • Processor.App
        • Processor.App.Downloader
        • Processor.App.Extractor
        • Processor.App.OutStreamer
        • Processor.App.Pipeline
        • Processor.App.Router
        • Processor.App.processor_utils
        • Processor.App.ArticleUtils
      • Processor.process_article
      • Processor.processor
        • Processor.processor.Listener
        • Processor.processor.ListnerStats
        • Processor.processor.Message
Theme by the Executable Book Project
  • .rst
Contents
  • Welcome to CommonCrawl Extractor’s documentation!
  • Indices and tables

Welcome to CommonCrawl Extractor’s documentation!

Contents

  • Welcome to CommonCrawl Extractor’s documentation!
  • Indices and tables

Welcome to CommonCrawl Extractor’s documentation!#

Contents:

  • Installation
    • Docker
  • Quick Start Guide
    • Quick Overview
      • 1. Querying CommonCrawl
      • 2. Downloading a file
      • 3. Choose parser
      • 4. Filtering out the web page
      • 5. Extract fields from the page
      • 6. File saving
    • Quickstart
      • Extractor
      • download_article.py
      • Extracting (Transformations)
      • Extracting( BS4 version)
      • Filtering
      • config.json
      • Testing our extractor
      • Running the extractor
    • Artemis Queue
  • API
    • Aggregator
      • Aggregator.App
      • Aggregator.aggregator
    • Processor
      • Processor.App
      • Processor.process_article
      • Processor.processor

Indices and tables#

  • Index

  • Module Index

  • Search Page

next

Installation

By Hynek Kydlíček
© Copyright 2022, Hynek Kydlíček.