Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / adbar/trafilatura issues and pull requests

#639 - Can I get an extracted element's CSS selector?

Issue - State: closed - Opened by theabhinavdas 4 months ago - 2 comments
Labels: question

#638 - Update crawls.rst typo: `known` is an unexpected argument

Pull Request - State: closed - Opened by tommytyc 5 months ago - 1 comment

#637 - build(deps): bump the dependencies group with 2 updates

Pull Request - State: closed - Opened by dependabot[bot] 5 months ago - 1 comment
Labels: dependencies

#636 - links/urls are not apprearing using extract

Issue - State: closed - Opened by alroythalus 5 months ago - 1 comment
Labels: feedback

#635 - fix: avoid faulty readability_lxml content

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#634 - some extraction duplicated in xml

Issue - State: open - Opened by fortyfourforty 5 months ago - 3 comments
Labels: question

#633 - Account for empty cells in table extraction (xml)

Issue - State: open - Opened by fortyfourforty 5 months ago - 3 comments
Labels: enhancement

#632 - weird xml extraction

Issue - State: closed - Opened by fortyfourforty 5 months ago - 2 comments
Labels: bug

#631 - prepare v1.11.0

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#630 - Deprecate Python 3.6 & 3.7

Issue - State: closed - Opened by adbar 5 months ago
Labels: maintenance

#629 - Deprecate GUI in its current form (Gooey)

Issue - State: closed - Opened by adbar 5 months ago
Labels: maintenance

#628 - extraction: simplify XML handling

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#627 - Sometimes, html tags remain on the string

Issue - State: closed - Opened by masylum 5 months ago - 2 comments
Labels: feedback

#626 - Error parsing non-English web pages

Issue - State: closed - Opened by vodkaslime 5 months ago - 2 comments
Labels: question

#625 - deduplication: shorter, more efficient code

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#624 - review spider code, add types and tests

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#623 - downloads: review code, tests, and add types

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#622 - Parts of article block are sometimes not being extracted

Issue - State: closed - Opened by naktinis 5 months ago - 3 comments
Labels: feedback

#621 - trafilatura.fetch_url Timeout is set but does not work

Issue - State: closed - Opened by Storm0921 5 months ago - 2 comments
Labels: question

#620 - metadata: simplify code and tests, add typing

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#619 - baseline: review extractor sequence, JSON parsing, and cleaning

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#618 - docs: update and extend

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#617 - Footer removal

Issue - State: closed - Opened by hamsarajan 5 months ago - 1 comment
Labels: bug

#616 - Image/Video caption and credits removal

Issue - State: open - Opened by hamsarajan 5 months ago - 3 comments
Labels: question, documentation

#615 - extraction: fix processing syntax and simplify code

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#614 - extraction: add HTML as output format

Pull Request - State: closed - Opened by adbar 5 months ago - 1 comment

#613 - use with_metadata argument as switch

Pull Request - State: closed - Opened by adbar 6 months ago - 2 comments

#611 - build(deps): bump the dependencies group with 5 updates

Pull Request - State: closed - Opened by dependabot[bot] 6 months ago - 1 comment
Labels: dependencies

#610 - It's set include_images=True, but there is no picture

Issue - State: open - Opened by dark2star 6 months ago - 5 comments
Labels: bug

#609 - Remove HTML doc pages from package and add instructions to build them

Issue - State: closed - Opened by adbar 6 months ago
Labels: documentation, maintenance

#608 - prepare version 1.10.0

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#607 - CLI fix: read standard input as binary

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#606 - Evaluation adjusted

Pull Request - State: closed - Opened by LydiaKoerber 6 months ago - 1 comment

#605 - CLI fixes: file processing options, mtime, and tests

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#604 - New port of readability.js?

Issue - State: open - Opened by zirkelc 6 months ago - 4 comments
Labels: question

#603 - fix typos

Pull Request - State: closed - Opened by RainRat 6 months ago - 2 comments

#602 - Enhancement using LLM based approach

Issue - State: closed - Opened by alroythalus 6 months ago - 1 comment

#601 - Markdown table fixes

Pull Request - State: closed - Opened by naktinis 6 months ago - 8 comments

#600 - Unordered list markdown syntax is incorrect

Issue - State: closed - Opened by naktinis 6 months ago - 2 comments

#599 - Table markdown syntax incorrect in some cases

Issue - State: closed - Opened by naktinis 6 months ago - 2 comments
Labels: bug

#598 - fix: list spacing in TXT output

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#597 - <li> tag output in TXT

Issue - State: closed - Opened by ethael 6 months ago - 1 comment
Labels: bug

#596 - Add option to provide XPaths for content extraction

Issue - State: open - Opened by klvbdmh 6 months ago - 2 comments
Labels: enhancement

#595 - `utils.decode_file()`: add switch for full detection or GZip only

Issue - State: open - Opened by adbar 6 months ago
Labels: enhancement

#594 - downloads: fix deflate and add optional zstd to accepted encodings

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#593 - setup: update justext and lxml dependencies

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#591 - simplify code: unique function for length tests

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#590 - spider fix: use internal download utilities for robots.txt

Pull Request - State: closed - Opened by adbar 6 months ago - 1 comment

#589 - focused_crawl returns nothing

Issue - State: closed - Opened by bezir 6 months ago - 6 comments
Labels: feedback

#588 - <main> Content gets missed out

Issue - State: closed - Opened by alroythalus 6 months ago - 1 comment
Labels: feedback

#587 - Port of is_probably_readerable from mozilla

Pull Request - State: closed - Opened by zirkelc 6 months ago - 15 comments

#586 - Extracting content from an URl is getting none

Issue - State: open - Opened by Fabiha15 7 months ago - 1 comment
Labels: question

#585 - Wrong links position in text from telegram post

Issue - State: open - Opened by RedHotUnicorn 7 months ago - 2 comments
Labels: question

#584 - Removing related links at end of article/sidebar on news websites?

Issue - State: open - Opened by rahulbot 7 months ago - 3 comments
Labels: bug

#583 - Simple content scoring prototype

Pull Request - State: closed - Opened by zirkelc 7 months ago - 8 comments

#582 - re-group classes and functions linked to deduplication

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#581 - breaking: raise errors on deprecated CLI and function arguments

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#580 - prepare version 1.9.0

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#579 - build(deps): bump the dependencies group with 4 updates

Pull Request - State: closed - Opened by dependabot[bot] 7 months ago - 1 comment
Labels: dependencies

#578 - docs: general update

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#577 - Update XML-TEI reference data

Issue - State: closed - Opened by adbar 7 months ago
Labels: maintenance

#576 - Regroup deduplication functions in same submodule

Issue - State: closed - Opened by adbar 7 months ago
Labels: documentation, maintenance

#575 - maintenance: reflect latest courlan changes

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#574 - tests: upgrade Python versions

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#573 - Extract text from buttons for semantic elements

Issue - State: open - Opened by zirkelc 7 months ago - 1 comment
Labels: question

#572 - Question: check if page is readable?

Issue - State: closed - Opened by zirkelc 7 months ago - 9 comments
Labels: question

#571 - extractor: improve recall preset

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#570 - fix download tests

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#569 - Content extraction failure on dozens of related sites

Issue - State: closed - Opened by praveng 7 months ago - 4 comments
Labels: bug

#568 - Content failed to be extracted

Issue - State: closed - Opened by alroythalus 7 months ago - 1 comment

#567 - metadata: add author XPaths

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#566 - No timeout in urllib.robotparser with focused_crawler

Issue - State: closed - Opened by JER-CE 7 months ago - 2 comments
Labels: bug

#565 - CLI & downloads: revamp options and make sure they are used

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#564 - docs: convert readme to markdown

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#563 - fix: table cell separators in non-XML output

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#562 - Markdown tables have incorrect format

Issue - State: closed - Opened by zirkelc 7 months ago - 1 comment
Labels: bug

#561 - metadata: add file creation date (date extraction, JSON & XML-TEI)

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#560 - Use `with_metadata` parameter to decide whether to run metadata extraction

Issue - State: closed - Opened by adbar 7 months ago
Labels: enhancement

#559 - Why lzma for data compression?

Issue - State: closed - Opened by Yomguithereal 7 months ago - 6 comments
Labels: maintenance

#558 - Scraping websites which are protected by WAF

Issue - State: closed - Opened by thebigbone 7 months ago - 7 comments
Labels: question

#557 - Readme.md table is broken.

Issue - State: closed - Opened by AnishPimpley 7 months ago - 1 comment
Labels: bug

#556 - restructure code

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#555 - strikethrough text is returned as normal

Issue - State: closed - Opened by snarb 7 months ago - 1 comment
Labels: question

#554 - fix: raise error if config file does not exist

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#553 - Preserve horizontal space in code blocks

Issue - State: open - Opened by mittsommer 7 months ago - 3 comments
Labels: enhancement

#552 - add global options object for extraction and use it in CLI

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#551 - Scraping directly from wayback machine (newbie question)

Issue - State: closed - Opened by scaramouche88 7 months ago - 6 comments
Labels: question

#550 - add markdown as explicit output

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#549 - maintenance: deprecate `process_record()`

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#548 - fix: better encoding detection

Pull Request - State: closed - Opened by adbar 7 months ago - 1 comment

#547 - speedup for readability-lxml

Pull Request - State: closed - Opened by adbar 8 months ago - 1 comment

#546 - Refactor and improve readability-lxml syntax

Issue - State: closed - Opened by adbar 8 months ago
Labels: enhancement

#545 - Fixed Extraction when Meta tag has an empty content

Pull Request - State: closed - Opened by felipehertzer 8 months ago - 4 comments

#544 - Respect no_fallback

Pull Request - State: closed - Opened by co-odw 8 months ago - 3 comments

#543 - refactoring: simplify code

Pull Request - State: closed - Opened by adbar 8 months ago - 1 comment

#542 - eval: review code, add guidelines and small benchmark

Pull Request - State: closed - Opened by adbar 8 months ago - 1 comment

#541 - Wrong encoding detected: gb2312

Issue - State: closed - Opened by s-jse 8 months ago - 3 comments
Labels: bug

#540 - Fixed bug with @data-testid and removed some classes

Pull Request - State: closed - Opened by felipehertzer 8 months ago - 14 comments

#539 - prepare version 1.8.1

Pull Request - State: closed - Opened by adbar 8 months ago - 1 comment

#538 - Make cascade of different content extractors explicit and configurable

Issue - State: open - Opened by adbar 8 months ago
Labels: enhancement