Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / adbar/trafilatura issues and pull requests

#743 - docs: remove from published packages

Pull Request - State: closed - Opened by adbar 4 days ago - 1 comment

#742 - extraction: move max_tree_size parameter to settings.cfg

Pull Request - State: closed - Opened by adbar 5 days ago - 1 comment

#741 - Extraction: move `max_tree_size` to config file

Issue - State: closed - Opened by adbar 5 days ago
Labels: enhancement

#740 - setup: explicit exports through `__all__`

Pull Request - State: closed - Opened by adbar 9 days ago - 1 comment

#739 - Extracting full text from an URL returns None

Issue - State: open - Opened by vrnch 10 days ago - 2 comments
Labels: question

#738 - Explicitly and fully support type hinting

Issue - State: open - Opened by adbar 11 days ago
Labels: enhancement

#737 - build(deps): bump the dependencies group with 5 updates

Pull Request - State: closed - Opened by dependabot[bot] 15 days ago - 1 comment
Labels: dependencies

#736 - downloads: cleaner urllib3 code

Pull Request - State: closed - Opened by adbar 15 days ago - 1 comment

#735 - downloads: better urllib3 setup

Pull Request - State: closed - Opened by adbar 16 days ago

#734 - CLI downloads: use all information in settings file

Pull Request - State: closed - Opened by adbar 16 days ago - 1 comment

#733 - Downloads: fully use information from both `config` and `options` variables

Issue - State: closed - Opened by adbar 17 days ago
Labels: maintenance

#732 - CLI downloads: make sure all user-specified options are used

Issue - State: closed - Opened by andyskipper 19 days ago - 4 comments
Labels: enhancement

#731 - evaluation: review data, update packages, add magic_html

Pull Request - State: closed - Opened by adbar 19 days ago - 1 comment

#730 - extraction: deprecate no_fallback and as_dict parameters

Pull Request - State: closed - Opened by adbar 23 days ago - 1 comment

#729 - `bare_extraction()`: deprecate `as_dict` parameter

Issue - State: closed - Opened by adbar 24 days ago
Labels: maintenance

#728 - typing: fix mypy errors

Pull Request - State: closed - Opened by adbar 24 days ago - 1 comment

#727 - simplify trim() function

Pull Request - State: closed - Opened by adbar 25 days ago - 1 comment

#726 - Focused crawler returns 404 response for robots.txt and stops crawling

Issue - State: closed - Opened by Guthman 28 days ago - 1 comment

#725 - `extract()`: replace `no_fallback` argument by `fast`

Issue - State: closed - Opened by adbar 29 days ago
Labels: maintenance

#724 - downloads: remove `decode` argument in `fetch_url()`

Pull Request - State: closed - Opened by adbar 29 days ago - 1 comment

#723 - refactoring: add type hints

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#722 - Deprecate `fetch_url(decode=False)`

Issue - State: closed - Opened by adbar about 1 month ago
Labels: maintenance

#721 - fix: more robust mapping for conversion to HTML

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#720 - Review HTML element list and conversion

Issue - State: open - Opened by adbar about 1 month ago
Labels: enhancement

#718 - setup: set `__all__` in `__init__.py`

Issue - State: closed - Opened by adbar about 1 month ago
Labels: maintenance

#717 - fix: robust encoding in options.source

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#716 - breaking: remove deprecated functions and args

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#715 - setup: use pyproject.toml file

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#714 - logging: better debug messages in main_extractor

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#713 - setup: deprecate current GUI

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#712 - setup: use `pyproject.toml` file

Issue - State: closed - Opened by adbar about 1 month ago
Labels: maintenance

#711 - Use rst link instead of markdown link in `docs/index.html`

Pull Request - State: closed - Opened by nzw0301 about 1 month ago - 1 comment

#710 - metadata: more robust URL extraction

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#709 - maintenance: deprecate 3.6 & 3.7 and simplify code base

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#708 - maintenance: remove superfluous RuntimeError catch

Pull Request - State: closed - Opened by adbar about 1 month ago - 1 comment

#707 - fix: set options.source before raising error on empty doc tree

Pull Request - State: closed - Opened by dmoklaf about 2 months ago - 2 comments

#706 - build(deps): bump the dependencies group with 5 updates

Pull Request - State: closed - Opened by dependabot[bot] about 2 months ago - 1 comment
Labels: dependencies

#705 - Trafilatura crashing due to `options` variable not backfilled yet

Issue - State: closed - Opened by dmoklaf about 2 months ago - 1 comment
Labels: bug

#704 - extract function runs indefinitely on large HTML body content

Issue - State: closed - Opened by hitesh1997 about 2 months ago - 1 comment
Labels: question

#703 - Download multiple urls with download timeout

Issue - State: closed - Opened by vodkaslime about 2 months ago - 2 comments
Labels: documentation

#702 - I can't extract main content from this html,could anyone help me?

Issue - State: closed - Opened by CNXDZS about 2 months ago - 1 comment
Labels: feedback

#701 - HTML_TAG_MAPPING error during scrape

Issue - State: closed - Opened by beefyandbeef about 2 months ago - 2 comments
Labels: bug

#700 - prepare v1.12.2

Pull Request - State: closed - Opened by adbar 2 months ago - 1 comment

#699 - update docs

Pull Request - State: closed - Opened by adbar 2 months ago - 1 comment

#698 - Docs: add page explaining how to run tests

Issue - State: open - Opened by adbar 2 months ago
Labels: documentation

#697 - Downloads: add support to switch between proxies

Issue - State: open - Opened by adbar 2 months ago
Labels: enhancement

#696 - Empty Results When Using Spider Function with Category URL

Issue - State: open - Opened by felipehertzer 2 months ago - 5 comments
Labels: question

#695 - Link on the quickstart page to the overview notebook is broken

Issue - State: closed - Opened by cdfuller 2 months ago - 1 comment
Labels: documentation

#694 - metadata: review and lint code

Pull Request - State: closed - Opened by adbar 2 months ago - 1 comment

#693 - ImportError: lxml.html.clean module is now a separate project

Issue - State: closed - Opened by regstuff 2 months ago - 2 comments
Labels: feedback

#692 - Javascript port of all 35 files

Pull Request - State: closed - Opened by vtempest 2 months ago - 1 comment

#691 - maintenance: make compression libraries optional

Pull Request - State: closed - Opened by adbar 2 months ago - 1 comment

#690 - Add max_sitemaps parameter to sitemap_search

Pull Request - State: closed - Opened by felipehertzer 2 months ago - 2 comments

#689 - build(deps): bump the dependencies group with 4 updates

Pull Request - State: closed - Opened by dependabot[bot] 3 months ago - 1 comment
Labels: dependencies

#688 - Javascript Version has landed. 🚀

Issue - State: closed - Opened by vtempest 3 months ago - 3 comments
Labels: question

#687 - spider: relax strict parameter for link extraction

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#685 - extraction fix: ValueError in table spans

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#684 - Added prune xpath to spider

Pull Request - State: closed - Opened by felipehertzer 3 months ago - 9 comments

#682 - Add SOCKS Proxy support

Pull Request - State: closed - Opened by gremid 3 months ago - 8 comments

#681 - ValueError in xml

Issue - State: closed - Opened by Honesty-of-the-Cavernous-Tissue 3 months ago - 3 comments
Labels: bug

#680 - Crawler doesn't extract any links from Google Cloud documentation website

Issue - State: closed - Opened by Guthman 3 months ago - 6 comments
Labels: bug

#679 - prepare version 1.12.1

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#678 - Fixed incorrect variable passed to extract_metadata

Pull Request - State: closed - Opened by jpigla 3 months ago - 2 comments

#677 - CLI: review code, add types and tests

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#676 - Remove deprecations (mostly CLI)

Issue - State: closed - Opened by adbar 3 months ago
Labels: maintenance

#675 - crawler: add params class

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#674 - maintenance: simplify link discovery

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#673 - spider: restrict search to site section targeted by input URL

Pull Request - State: closed - Opened by adbar 3 months ago - 1 comment

#672 - spider: restrict search to given URL pattern

Issue - State: closed - Opened by adbar 3 months ago
Labels: enhancement

#670 - trafilatura version > 1.10.0 doesnt fetch images

Issue - State: closed - Opened by rkiacnhg 3 months ago - 3 comments

#669 - build(deps): bump the dependencies group with 2 updates

Pull Request - State: closed - Opened by dependabot[bot] 4 months ago - 1 comment
Labels: dependencies

#668 - robust element deletion: fix AttributeError

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#667 - AttributeError in prune_unwanted_sections

Issue - State: closed - Opened by Honesty-of-the-Cavernous-Tissue 4 months ago - 3 comments
Labels: bug

#665 - table fix: maximum number of header columns

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#664 - prepare v1.12.0

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#663 - feat(cli/lib): Add tqdm based progress bar as an option

Issue - State: open - Opened by chitralverma 4 months ago - 1 comment
Labels: enhancement

#662 - Bug or feature, I'm not sure!

Issue - State: closed - Opened by szj2ys 4 months ago - 1 comment
Labels: duplicate

#661 - Investigate spacing in element tails

Issue - State: open - Opened by adbar 4 months ago - 3 comments
Labels: question

#660 - Faulty extraction for very short documents

Issue - State: open - Opened by Psynbiotik 4 months ago - 4 comments
Labels: enhancement

#658 - table fix: MemoryError & ValueError during conversion to text

Pull Request - State: closed - Opened by adbar 4 months ago - 3 comments

#657 - MemoryError in table conversion

Issue - State: closed - Opened by Honesty-of-the-Cavernous-Tissue 4 months ago - 2 comments
Labels: bug

#656 - formatting & markdown fix: add newlines

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#655 - XML-TEI: replace RelaxNG by DTD, remove pickle, and update

Pull Request - State: closed - Opened by adbar 4 months ago

#654 - images fix: use a length threshold on src attribute

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#653 - extraction: review link and structure checks

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#652 - extraction: improve justext fallback

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#651 - Extraction with `include_images=True` takes too much time

Issue - State: closed - Opened by Honesty-of-the-Cavernous-Tissue 4 months ago - 3 comments
Labels: bug

#650 - Add magic_html to benchmarks

Issue - State: closed - Opened by dantetemplar 4 months ago - 2 comments
Labels: evaluation

#649 - CLI fix: markdown format should trigger include_formatting

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#648 - CLI: Trigger formatting parameter when the output is in Markdown format

Issue - State: closed - Opened by adbar 4 months ago
Labels: bug

#647 - output formats: enforce fixed list, deprecate -out on the CLI

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#646 - precision fix: do not use baseline as backup extraction

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#645 - review XPaths for undesirable content

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#644 - Validate value of `output_format` in `extract()` and `bare_extraction()`

Issue - State: closed - Opened by adbar 4 months ago
Labels: enhancement

#643 - baseline fix: prevent LXML error in JSON-LD

Pull Request - State: closed - Opened by adbar 4 months ago - 1 comment

#642 - Missing h1 heading if <header> outside of <article>

Issue - State: open - Opened by chrisgoddard 4 months ago - 2 comments
Labels: question

#641 - Impossible to extract Ryan Reynolds website

Issue - State: closed - Opened by Philrobots 4 months ago - 1 comment
Labels: feedback

#640 - AttributeError in baseline extraction of JSON text

Issue - State: closed - Opened by Honesty-of-the-Cavernous-Tissue 4 months ago
Labels: bug