Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / adbar/trafilatura issues and pull requests

#435 - Improve SEO by adding sitemap to sphinx docs

Pull Request - State: closed - Opened by tonyyanga about 1 year ago - 3 comments

#434 - add htmldate extensive search to config

Pull Request - State: closed - Opened by adbar about 1 year ago - 1 comment

#433 - trafilatura fails extracting

Issue - State: closed - Opened by tejeshbhalla about 1 year ago - 7 comments
Labels: question

#432 - Entire/majority content of these 2 sites being missed out

Issue - State: open - Opened by alroythalus about 1 year ago - 4 comments
Labels: enhancement

#431 - List items are being missed

Issue - State: open - Opened by alroythalus about 1 year ago - 9 comments
Labels: bug

#430 - Parts are getting missed out after using extract funtion

Issue - State: open - Opened by alroythalus about 1 year ago - 1 comment
Labels: enhancement

#429 - preserve space in certain elements

Pull Request - State: closed - Opened by idoshamun about 1 year ago - 20 comments

#428 - docs: update and extend

Pull Request - State: closed - Opened by adbar about 1 year ago - 1 comment

#427 - crawl only sub-pages from an arbitrary URL?

Issue - State: closed - Opened by pchalasani about 1 year ago - 3 comments
Labels: question

#426 - Unable to extract text from a given site, TypeError: unhashable type: 'set'

Issue - State: closed - Opened by noobistz about 1 year ago - 5 comments

#425 - Test on Python 3.12 production release

Pull Request - State: closed - Opened by cclauss about 1 year ago - 6 comments

#424 - build(deps): bump trafilatura from 1.5.0 to 1.6.2

Pull Request - State: closed - Opened by dependabot[bot] about 1 year ago - 2 comments
Labels: dependencies

#423 - Inconsistent behavior on macOS

Issue - State: closed - Opened by p-linnane about 1 year ago - 3 comments
Labels: feedback

#422 - Multiple spaces within a text element are not supported

Issue - State: closed - Opened by idoshamun about 1 year ago - 5 comments
Labels: enhancement

#420 - Error when multiproessing

Issue - State: closed - Opened by fortyfourforty about 1 year ago - 1 comment

#419 - docs: fix quickstart

Pull Request - State: closed - Opened by sashkab about 1 year ago - 1 comment

#417 - build(deps): bump resiliparse from 0.14.3 to 0.14.5

Pull Request - State: closed - Opened by dependabot[bot] about 1 year ago - 2 comments
Labels: dependencies

#416 - build(deps): bump news-please from 1.5.22 to 1.5.35

Pull Request - State: closed - Opened by dependabot[bot] about 1 year ago - 2 comments
Labels: dependencies

#415 - prepare v1.6.2

Pull Request - State: closed - Opened by adbar about 1 year ago - 1 comment

#414 - added possibility to prune xPaths

Pull Request - State: closed - Opened by HeLehm about 1 year ago - 5 comments

#413 - Error installing trafilatura on playwright focal image

Issue - State: open - Opened by jaekunchoi about 1 year ago - 1 comment
Labels: question

#412 - Consider switching from lxml's clean_html for enhanced security (and possibly performance)

Issue - State: closed - Opened by frenzymadness about 1 year ago - 1 comment
Labels: enhancement

#411 - include_links breaks the extraction for https://news.ycombinator.com

Issue - State: open - Opened by shivanker about 1 year ago - 2 comments
Labels: bug

#410 - Returns horribly bad result for MSN page

Issue - State: open - Opened by TheRabidWolverine about 1 year ago - 1 comment
Labels: bug

#409 - Installation problem on Mac due to charset version mismatch

Issue - State: closed - Opened by TheRabidWolverine about 1 year ago - 1 comment

#408 - maintenance: simplify code

Pull Request - State: closed - Opened by adbar about 1 year ago - 1 comment

#407 - docs: fix typo in usage-python.rst

Pull Request - State: closed - Opened by eltociear about 1 year ago - 1 comment

#406 - Some language tidy-ups

Pull Request - State: closed - Opened by marksmayo over 1 year ago - 3 comments

#405 - Web API idea

Issue - State: closed - Opened by clach04 over 1 year ago - 4 comments
Labels: enhancement

#404 - Corrupted Markdown output when TXT+formatting

Issue - State: closed - Opened by clach04 over 1 year ago - 2 comments
Labels: bug

#403 - Use of Signal Prevents Multithreading on Linux

Issue - State: closed - Opened by simplexx over 1 year ago - 1 comment

#402 - Question about the title

Issue - State: open - Opened by pieterhartel over 1 year ago - 5 comments
Labels: question

#401 - improve code support

Pull Request - State: closed - Opened by idoshamun over 1 year ago - 20 comments

#400 - Empty h1 blocks non-empty h2

Issue - State: open - Opened by pieterhartel over 1 year ago - 1 comment
Labels: bug

#399 - build(deps): bump lxml from 4.9.2 to 4.9.3

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 2 comments
Labels: dependencies

#398 - build(deps): bump goose3 from 3.1.13 to 3.1.17

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 2 comments
Labels: dependencies

#397 - author metadata field is null for YouTube videos

Issue - State: closed - Opened by basilioss over 1 year ago - 2 comments
Labels: enhancement

#396 - `included_images` failed when trying to extract images in a table

Issue - State: open - Opened by ChangyaoTian over 1 year ago - 7 comments
Labels: bug

#395 - Redirecting https://twitter.com

Issue - State: closed - Opened by proteusbr1 over 1 year ago - 1 comment
Labels: question

#393 - fix: pinned LXML version for MacOS

Pull Request - State: closed - Opened by adbar over 1 year ago

#392 - add checks to probing mode

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#391 - Is it possible to get the metadata with markdown format?

Issue - State: closed - Opened by charleshan over 1 year ago - 1 comment
Labels: enhancement

#390 - Code tags are not parsed properly

Issue - State: closed - Opened by charleshan over 1 year ago - 3 comments
Labels: question

#389 - courlan changes: adapt parameter and tests

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#388 - Image markdown not included during processing

Issue - State: open - Opened by kianwilcox over 1 year ago - 5 comments
Labels: bug

#387 - build(deps): bump goose3 from 3.1.13 to 3.1.16

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 2 comments
Labels: dependencies

#386 - build(deps): bump trafilatura from 1.5.0 to 1.6.1

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 2 comments
Labels: dependencies

#385 - Code example for Multi-Threaded downloads seems out of date

Issue - State: closed - Opened by github-mickael-leclerc over 1 year ago - 4 comments
Labels: documentation

#384 - remove signal from core and use on CLI only

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#383 - Fix JSON-LD list on sitename

Pull Request - State: closed - Opened by felipehertzer over 1 year ago - 2 comments

#382 - Check URLs passed to courlan functions `extract_links` and `fix_relative_urls`

Issue - State: open - Opened by adbar over 1 year ago - 1 comment
Labels: question

#381 - CLI: more robust processing with chunks

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#380 - setup: fix and update CI workflows

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#379 - Doesn't seem to work with recent charset-normalizer

Issue - State: closed - Opened by Stevod over 1 year ago - 2 comments
Labels: feedback

#378 - CLI: add option to probe for extractable content, more robust downloads and html2txt

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#377 - Convert relative URLs in links to absolute by default

Pull Request - State: closed - Opened by feltcat over 1 year ago - 7 comments

#376 - Option to convert relative links

Issue - State: closed - Opened by feltcat over 1 year ago - 1 comment
Labels: enhancement

#375 - XMLSyntaxError during conversion to XML output

Issue - State: closed - Opened by fortyfourforty over 1 year ago - 2 comments
Labels: bug

#374 - metadata extraction problem

Issue - State: closed - Opened by fortyfourforty over 1 year ago - 8 comments

#372 - improve code block support

Pull Request - State: closed - Opened by idoshamun over 1 year ago - 4 comments

#371 - prepare version 1.6.1

Pull Request - State: closed - Opened by adbar over 1 year ago

#370 - more efficient HTML parsing code

Pull Request - State: closed - Opened by adbar over 1 year ago - 4 comments

#369 - Function to use part of the heuristics on bare HTML fragments

Issue - State: open - Opened by adbar over 1 year ago
Labels: enhancement

#368 - Improving JSON tests

Pull Request - State: closed - Opened by felipehertzer over 1 year ago - 2 comments

#367 - Gooey dependency seems unmaintained and broken

Issue - State: open - Opened by tkapias over 1 year ago - 1 comment
Labels: wontfix

#366 - More robust backup parsing

Issue - State: closed - Opened by adbar over 1 year ago - 1 comment
Labels: enhancement

#365 - metadata fixes: authors, JSON parser, Unicode

Pull Request - State: closed - Opened by felipehertzer over 1 year ago - 7 comments

#364 - docs roundup

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#363 - [#362] Fix metadata extraction w/o 'additionalName' field

Pull Request - State: closed - Opened by awwitecki over 1 year ago - 3 comments

#362 - Unable to extract metadata w/o authors `additionalName`

Issue - State: closed - Opened by awwitecki over 1 year ago

#361 - build(deps): bump trafilatura from 1.5.0 to 1.6.0

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#360 - adopt latest courlan changes

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#359 - fix spider init

Pull Request - State: closed - Opened by adbar over 1 year ago

#358 - Restrictions on Web Crawling

Issue - State: closed - Opened by conceptofmind over 1 year ago - 2 comments
Labels: bug

#357 - extraction: bypass for tables in figures (#301)

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#356 - minor extraction fixes

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#355 - [1.6.0] New Content Hashes

Issue - State: closed - Opened by felipehertzer over 1 year ago - 2 comments
Labels: documentation

#354 - Cannot extract Heading tags

Issue - State: closed - Opened by fortyfourforty over 1 year ago - 26 comments
Labels: bug

#353 - fix: relax constrains on spider tests

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#352 - simplify code for JSON metadata extraction

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#351 - Codeblock Markdown formatting is missing

Issue - State: closed - Opened by niksite over 1 year ago - 3 comments
Labels: enhancement

#350 - sitemaps: use class and simplify code structure

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#348 - prepare v1.6.0

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#347 - review logging levels

Pull Request - State: closed - Opened by adbar over 1 year ago

#346 - Unnecessary Comments LOG INFO

Issue - State: closed - Opened by andremacola over 1 year ago - 1 comment
Labels: enhancement

#345 - upgrade dependencies to allow for urllib3 v2

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#344 - build(deps): bump beautifulsoup4 from 4.12.1 to 4.12.2

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#343 - build(deps): update urllib3 requirement from <2,>=1.26 to >=1.26,<3

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#342 - build(deps): bump goose3 from 3.1.13 to 3.1.14

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#341 - build(deps): bump news-please from 1.5.22 to 1.5.33

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 1 comment
Labels: dependencies

#340 - settings: upper bound on links examined

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#339 - review url blacklisting

Pull Request - State: closed - Opened by adbar over 1 year ago

#338 - CLI: more efficient downloads

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#337 - Simple HTML processing issue?

Issue - State: closed - Opened by alroythalus over 1 year ago - 4 comments
Labels: question

#336 - feeds & sitemaps: check domain similarity

Pull Request - State: closed - Opened by adbar over 1 year ago - 1 comment

#335 - Doesn't detect bullet points within tables

Issue - State: closed - Opened by alroythalus over 1 year ago - 5 comments
Labels: enhancement

#334 - Paras get broken up into fragments

Issue - State: closed - Opened by alroythalus over 1 year ago - 3 comments

#333 - Headers with classes dont get detected

Issue - State: closed - Opened by alroythalus over 1 year ago - 1 comment

#332 - feat: use proxy to extract data

Pull Request - State: closed - Opened by andremacola over 1 year ago - 7 comments
Labels: feedback

#331 - Update core.py

Pull Request - State: closed - Opened by Korben00 over 1 year ago - 3 comments
Labels: feedback