useful package in python

Dec 5, 2022 00:00 · 395 words · 2 minute read python

这里汇总一些小众但是常用的python以及其他软件包

PYTHON

pynvml

Provides a Python interface to GPU management and monitoring functions. This is a wrapper around the NVML library. For information about the NVML library, see the NVML developer page http://developer.nvidia.com/nvidia-management-library-nvml As of version 11.0.0, the NVML-wrappers used in pynvml are identical to those published through nvidia-ml-py. Note that this file can be run with ‘python -m doctest -v README.txt’ although the results are system dependent

Usage:

You can use the lower level nvml bindings

>>> from pynvml import *
>>> nvmlInit()
>>> print("Driver Version:", nvmlSystemGetDriverVersion())
Driver Version: 410.00
>>> deviceCount = nvmlDeviceGetCount()
>>> for i in range(deviceCount):
...     handle = nvmlDeviceGetHandleByIndex(i)
...     print("Device", i, ":", nvmlDeviceGetName(handle))
...
Device 0 : Tesla V100

>>> nvmlShutdown()

Or the higher level nvidia_smi API

from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
nvsmi.DeviceQuery('memory.free, memory.total')
from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
print(nvsmi.DeviceQuery('--help-query-gpu'), end='\n')

TQDM

tqdm derives from the Arabic word taqaddum (تقدّم) which can mean “progress,” and is an abbreviation for “I love you so much” in Spanish (te quiero demasiado). Instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable), and you’re done!

Usage:

from tqdm import tqdm
for i in tqdm(range(10000)):
    ...

Change color using colour='red'

https://tqdm.github.io/docs/tqdm/#update

PARAMIKO

Paramiko is a pure-Python [1] (2.7, 3.4+) implementation of the SSHv2 protocol [2], providing both client and server functionality. It provides the foundation for the high-level SSH library Fabric, which is what we recommend you use for common client use-cases such as running remote shell commands or transferring files.

viztracer

https://github.com/gaogaotiantian/viztracer

VizTracer is a low-overhead logging/debugging/profiling tool that can trace and visualize your python code execution.

The front-end UI is powered by Perfetto. Use “AWSD” to zoom/navigate. More help can be found in “Support - Controls”.

example_img

Cdx 工具

https://github.com/cocrawler/cdx_toolkit/

相关链接

skeptric - Searching 100 Billion Webpages Pages With Capture Index

Pythonspark

https://github.com/commoncrawl/cc-pyspark

work with the columnar URL index

可以通过colomuner文件筛选中文的语料

Markdownify

Html 转 markdown

https://github.com/matthewwithanm/python-markdownify

Deduplicate

数据去重

https://github.com/google-research/deduplicate-text-datasets

https://mrjob.readthedocs.io/ mapreduce

https://github.com/gaogaotiantian/viztracer

NETWORK

shadowsocks-local

proxychains4

DISPLAY

x11vnc

Common crawl

共有8种文件格式

其中,warc带html信息, wat是抽取的文本, url index 以及 columnar url index 是索引文件,

File List #Files Total Size Compressed (TiB)
Segments 100 CC-MAIN-2022-49/segment.paths.gz
WARC files |CC-MAIN-2022-49/warc.paths.gz 88000 92.59
WAT files CC-MAIN-2022-49/wat.paths.gz 88000 22.89
WET files CC-MAIN-2022-49/wet.paths.gz 88000 9.58
Robots.txt files CC-MAIN-2022-49/robotstxt.paths.gz 88000 0.15
Non-200 responses files CC-MAIN-2022-49/non200responses.paths.gz 88000 2.43
URL index files CC-MAIN-2022-49/cc-index.paths.gz 302 0.25
Columnar URL index files CC-MAIN-2022-49/cc-index-table.paths.gz 900 0.28
- - - -

https://github.com/cocrawler/cdx_toolkit/blob/main/examples/iter-and-warc.py