My Note / Zeliang YAO
  • Zeliang's Note
  • Dremio
    • Custom Class
  • 💕Python
    • Design Pattern
      • Creational
        • Abstract Factory
        • Factory Method
        • Singleton
        • Builder / Director
      • Structural
        • Adapter
    • Boto3
    • Typing
    • String
    • Requests
    • Iterator & Iterable
      • Consuming iterator manually
      • Lazy Iterable
    • Genrators
    • itertools
    • Collections
    • Customization
      • Customize built-in
      • Logging
      • Hdf5
      • Sqlite3 & Df
    • Pandas
      • Basic
      • Data cleaning
      • Merge, Join, Concat
      • Useful tricks
      • Simple model
      • Pandas acceleration
    • Pandas time series
      • Date Range
      • Datetime Index
      • Holidays
      • Function_to_date_time
      • Period
      • Time zone
    • *args and**kwargs
    • Context Manager
    • Lambda
    • SHA
    • Multithreading
      • Threading
      • Speed Up
    • Email
    • Improvement
    • Useful functions
    • Python OOP
      • Basic
      • @static / @class method
      • attrs module
      • Dataclasses
      • Dataclasses example
      • Others
    • Design patterns
      • Creational Patterns
      • Structural Patterns
      • Behavioral Patterns
  • 🐣Git/Github
    • Commands
  • K8s
    • Useful commands
  • Linux
    • Chmod
Powered by GitBook
On this page
  • defaultdict
  • Counter
  • Deque
  • Namedtuple
  • OrderedDict
  • Chainmap

Was this helpful?

  1. Python

Collections

基于我目前的学习经验,以下几种类型用的很多:

  • defaultdict (dict子类调用工厂函数来提供缺失值)

  • counter (用于计算可哈希对象的dict子类)

  • deque (类似于列表的容器,可以从两端操作)

  • namedtuple (用于创建具有命名字段的tuple子类的工厂函数)

defaultdict

基础概念

“defaultdict”是在名为“collections”的模块中定义的容器。它需要一个函数(默认工厂)作为其参数。默认情况下设置为“int”,即0.如果键不存在则为defaultdict,并返回并显示默认值。

其实就是一个查不到key值时不会报错的dict

应用实例

创建了一个叫person的字典,里面存储的key值为name,age,如果这时候尝试调用person['city'],会抛出KeyError错误,因为没有city这个键值:

person = {'name':'xiaobai','age':18}
print ("The value of key  'name' is : ",person['name'])
print ("The value of key  'city' is : ",person['city'])

Out: The value of key  'name' is :  xiaobai
Traceback (most recent call last):
  File "C:\Users\E560\Desktop\test.py", line 17, in <module>
    print ("The value of key  'city' is : ",person['city'])
KeyError: 'city'

用defaultdict再试试:

from collections import defaultdict
person = defaultdict(lambda : 'Key Not found') # 初始默认所有key对应的value均为‘Key Not Found’

person['name'] = 'xiaobai'
person['age'] = 18

print ("The value of key  'name' is : ",person['name'])
print ("The value of key  'adress' is : ",person['city'])

Out:The value of key  'name' is :  xiaobai
     The value of key  'adress' is :  Key Not found

除此之外,我们还可以利用defaultdict创建时,传递参数为所有key默认value值这一特性,实现一些其他的功能,比如:

from collections import defaultdict
d = defaultdict(list)
d['person'].append("xiaobai")
d['city'].append("paris")
d['person'].append("student")

for i in d.items():
    print(i)

Out: ('person', ['xiaobai', 'student'])
     ('city', ['paris'])

我们默认所有key对应的是一个list,自然就可以在赋值时使用list的append方法了。再比如这个例子:

from collections import defaultdict
food = (
    ('jack', 'milk'),
    ('Ann', 'fruits'),
    ('Arham', 'ham'),
    ('Ann', 'soda'),
    ('jack', 'dumplings'),
    ('Ahmed', 'fried chicken'),
)

favourite_food = defaultdict(list)

for n, f in food:
    favourite_food[n].append(f)

print(favourite_food)

Out:defaultdict(<class 'list'>, {'jack': ['milk', 'dumplings'], 'Ann': ['fruits', 'soda'], 'Arham': ['ham'], 'Ahmed': ['fried chicken']})

from collections import defaultdict
pets = [
    ("dog", "Affenpinscher"),
    ("dog", "Terrier"),
    ("dog", "Boxer"),
    ("cat", "Abyssinian"),
    ("cat", "Birman"),
]

group_pets = defaultdict(list)

for pet, breed in pets:
    group_pets[pet].append(breed)

for pet, breeds in group_pets.items():
    print(pet, "->", breeds)

dog -> ['Affenpinscher', 'Terrier', 'Boxer']
cat -> ['Abyssinian', 'Birman']

Counter

基础概念

Counter是dict的子类,一个计数器

返回一个字典,key就是出现的元素,value就是该元素出现的次数

应用实例

from collections import Counter

count_list = Counter(['B','B','A','B','C','A','B','B','A','C'])  #计数list
print (count_list)


count_tuple = Counter((2,2,2,3,1,3,1,1,1))  #计数tuple
print(count_tuple)

Out:Counter({'B': 5, 'A': 3, 'C': 2})
     Counter({1: 4, 2: 3, 3: 2})

配合dataframe也可以,当然,df有value_counts()方法:

df= pd.DataFrame({'name':['a','b','c','a','a','b'],'value':[1,2,3,4,5,6]})

counter_result = Counter(df['name'])
counter_result

Out:Counter({'a': 3, 'b': 2, 'c': 1})

# df['frequency'] =[ counter_result[n] for n in df['name'] ] 
df['frequency'] = df['name'].map(df['name'].value_counts())

Counter一般不会用于dict和set的计数,因为dict的key是唯一的,而set本身就不能有重复元素

现在我们也可以直接把在defaultdict例子中用过food元组拿来计数:

from collections import Counter
food = (
    ('jack', 'milk'),
    ('Ann', 'fruits'),
    ('Arham', 'ham'),
    ('Ann', 'soda'),
    ('jack', 'dumplings'),
    ('Ahmed', 'fried chicken'),
)

favourite_food_count = Counter(n for n,f in food)  #统计name出现的次数
print(favourite_food_count)

Out: Counter({'jack': 2, 'Ann': 2, 'Arham': 1, 'Ahmed': 1})

substract:

from collections import Counter

inventory = Counter(dogs=23, cats=14, pythons=7)

adopted = Counter(dogs=2, cats=5, pythons=1)
inventory.subtract(adopted)
inventory

=>Counter({'dogs': 21, 'cats': 9, 'pythons': 6})


new_pets = {"dogs": 4, "cats": 1}
inventory.update(new_pets)
inventory
=>Counter({'dogs': 25, 'cats': 10, 'pythons': 6})

new_pets = {"dogs": 4, "pythons": 2}
inventory += new_pets
inventory
Counter({'dogs': 27, 'cats': 7, 'pythons': 7})

Deque

基础概念

在我们需要在容器两端的更快的添加和移除元素的情况下,可以使用deque. 我的个人理解是deque就是一个可以两头操作的容器,类似list但比列表速度更快

应用实例

deque的方法有很多,很多操作和list类似,也支持切片

from collections import deque
d = deque()
d.append(1)
d.append(2)
d.append(3)

print(len(d))
print(d[0])
print(d[-1])

Out: 3
     1
     3
===============================================
print(deque([1, 2, 3, 4]))
print(deque(range(1, 5)))
print(deque("abcd"))
numbers = {"one": 1, "two": 2, "three": 3, "four": 4}
print(deque(numbers.keys()))
print(deque(numbers.values()))
print(deque(numbers.items()))

deque([1, 2, 3, 4])
deque([1, 2, 3, 4])
deque(['a', 'b', 'c', 'd'])
deque(['one', 'two', 'three', 'four'])
deque([1, 2, 3, 4])
deque([('one', 1), ('two', 2), ('three', 3), ('four', 4)])
======================================
numbers = deque([1, 2, 3, 4])
numbers.popleft()  #1
numbers.popleft()  #2

numbers = deque([1, 2, 3, 4])
numbers.pop()
numbers  # deque([1, 2, 3])


letters = deque("abde")
letters.insert(2, "c")
letters   #deque(['a', 'b', 'c', 'd', 'e'])

letters.remove("d")
letters   # deque(['a', 'b', 'c', 'e'])

deque最大的特点在于我们可以从两端操作:

d = deque([i for i in range(5)])
print(len(d))
# Output: 5

d.popleft()   # 删除并返回最左端的元素
# Output: 0

d.pop()       # 删除并返回最右端的元素
# Output: 4

print(d)
# Output: deque([1, 2, 3])

d.append(100)  # 从最右端添加元素

d.appendleft(-100) # 从最左端添加元素

print(d)
# Output: deque([-100, 1, 2, 3, 100])

再举几个常用的例子,定义一个deque时可以规定它的最大长度,deque和list一样也支持extend方法,方便列表拼接,但是deque提供双向操作:

from collections import deque
d = deque([1,2,3,4,5], maxlen=9)  #设置总长度不变
d.extendleft([0])  # 从左端添加一个list
d.extend([6,7,8])   # 从右端拓展一个list
print(d)

Out:deque([0, 1, 2, 3, 4, 5, 6, 7, 8], maxlen=9)

现在d已经有9个元素了,规定的maxlen=9,这个时候如果从左边添加元素,会自动移除最右边的元素,反之也是一样:

d.append(100)
print(d)
d.appendleft(-100)
print(d)

Out: deque([1, 2, 3, 4, 5, 6, 7, 8, 100], maxlen=9)
     deque([-100, 1, 2, 3, 4, 5, 6, 7, 8], maxlen=9)

deque还有很多其他的用法:

# custom_queue.py

from collections import deque

class Queue:
    def __init__(self):
        self._items = deque()

    def enqueue(self, item):
        self._items.append(item)

    def dequeue(self):
        try:
            return self._items.popleft()
        except IndexError:
            raise IndexError("dequeue from an empty queue") from None

    def __len__(self):
        return len(self._items)

    def __contains__(self, item):
        return item in self._items

    def __iter__(self):
        yield from self._items

    def __reversed__(self):
        yield from reversed(self._items)

    def __repr__(self):
        return f"Queue({list(self._items)})"

numbers = Queue()
# Enqueue items
for number in range(1, 5):
    numbers.enqueue(number)

numbers   # Queue([1, 2, 3, 4])

限制长度:

four_numbers = deque([0, 1, 2, 3, 4], maxlen=4) # Discard 0
four_numbers  # deque([1, 2, 3, 4])

four_numbers.append(5)  # Automatically remove 1
four_numbers  #deque([2, 3, 4, 5])

four_numbers.append(6)  # Automatically remove 2
four_numbers  # deque([3, 4, 5, 6])

four_numbers.appendleft(2) # Automatically remove 6
four_numbers # deque([2, 3, 4, 5])

另外一个例子:网页浏览历史

sites = (
    "google.com",
    "yahoo.com",
    "bing.com"
)

pages = deque(maxlen=3)
pages.maxlen #3

for site in sites:
    pages.appendleft(site)
pages   #deque(['bing.com', 'yahoo.com', 'google.com'])

pages.appendleft("facebook.com")
pages.appendleft("twitter.com")
pages  #deque(['twitter.com', 'facebook.com', 'bing.com'])

Namedtuple

基础概念

名称元组。namedtuple可以将元组转换为方便的容器。使用namedtuple,我们不必使用整数索引来访问元组的成员。

我觉得可以把namedtuple 视为 不可变的 字典

应用实例

from collections import namedtuple

Person = namedtuple('Person', 'name age city')        # 类似于定义class
xiaobai = Person(name="xiaobai", age=18, city="paris") # 类似于新建对象
print(xiaobai)

Out:Person(name='xiaobai', age=18, city='paris')


print(xiaobai.name)
print(xiaobai.age)
print(xiaobai.city)

out:xiaobai
     18
     paris

使用场景:读取csv,利用namedtuple 存储,后续添加hash_id 或者切换到dataframe

# create hash id for a namedtuple
cols =['a','b','c']
fields = " ".join(cols)+'hash_id'

Data = namedtuple('Data',fields,defaults=""*len(fields))

raw_result = []
with open('xx.csv',encoding='utf-8') as file:
	file_iter = iter(file)
	_ = next(file_iter) # Jump first line
	for line in file_iter:
		each_line = line.strip('\n').split(';')+['']
		raw_result.append(Data(*each_line))
	


def create_hash_id(each):
	text = "".join(x.replace(" ","") for x in each.asdict().values()
	return each._replace(hash_id=hashlib.sha256(text.encode('utf-8')).hexdigest())

# A list of tuples 
a = [Data(...),Data(...)]
new_a = list(map(create_hash_id,a))

# Get values for a namedtuple
a[0]._asdict().values()

# Get fields for a namedtuple
a[0]._fields

# Turn the result to dataframe
df = pd.Dataframe(a,columns=Data._fields)

With class

class DataPoint(namedtuple('DataPoint', ['date', 'value'])):
    __slots__ = ()

    def __le__(self, other):
        return self.value <= other.value

    def __lt__(self, other):
        return self.value < other.value

    def __gt__(self, other):
        return self.value > other.value

City = namedtuple('City', 'name country population coordinates')
tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))
tokyo
=>City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722, 139.691667))

tokyo._fields
=>('name', 'country', 'population', 'coordinates')

LatLong = namedtuple('LatLong', 'lat long')
delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889))
delhi = City._make(delhi_data) 
delhi._asdict() 
for key, value in delhi._asdict().items():
    print(key + ':', value)

name: Delhi NCR
country: IN
population: 21.935
coordinates: LatLong(lat=28.613889, long=77.208889)

delhi.coordinates.lat  # 28.613889

Person = namedtuple("Person", "name age height")
jane = Person("Jane", 25, 1.75)
print(jane._asdict())
jane._asdict()['name']

OrderedDict([('name', 'Jane'), ('age', 25), ('height', 1.75)])
'Jane'

Replacing Fields in Existing namedtuple Instances

from collections import namedtuple

Person = namedtuple("Person", "name age height")
jane = Person("Jane", 25, 1.75)
# After Jane's birthday
jane = jane._replace(age=26)
jane

=>Person(name='Jane', age=26, height=1.75)

Exploring Additional namedtuple Attributes

Person = namedtuple("Person", "name age height")

ExtendedPerson = namedtuple(
    "ExtendedPerson",
    [*Person._fields, "weight"]
)

jane = ExtendedPerson("Jane", 26, 1.75, 67)
jane

jane.weight
=>67

For loop namedtuple

Person = namedtuple("Person", "name age height weight")
jane = Person("Jane", 26, 1.75, 67)
for field, value in zip(jane._fields, jane):
    print(field, "->", value)
    
name -> Jane
age -> 26
height -> 1.75
weight -> 67

Default Values

Person = namedtuple(
    "Person",
    "name age height weight country",
    defaults=[185,"Canada",75]
)
print(Person._field_defaults)
{'height': 185, 'weight': 'Canada', 'country': 75}

Mike= Person("Mike",24)
Mike
Person(name='Mike', age=24, height=185, weight='Canada', country=75)

Returning Multiple Named Values From Functions

def custom_divmod(a, b):
    DivMod = namedtuple("DivMod", "quotient remainder")
    return DivMod(*divmod(a, b))
    
custom_divmod(8, 4)
=>DivMod(quotient=2, remainder=0)

Reducing the Number of Arguments to Functions

User = namedtuple("User", "username client_name plan")
user = User("john", "John Doe", "Premium")

def create_user(db, user):
    db.add_user(user.username)
    db.complete_user_profile(
        user.username,
        user.client_name,
        user.plan
    )

namedtuple vs Data Class

Data Classes can be thought of as “mutable namedtuples with defaults.” (Source) However, it’d be more accurate to say that data classes are like mutable named tuples with type hints. The “defaults” part isn’t a difference at all because named tuples can also have default values for their fields. So, at first glance, the main differences are mutability and type hints.

from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int
    height: float
    weight: float
    country: str = "Canada"


jane = Person("Jane", 25, 1.75, 67)
print(jane.name)
jane.name = "Mike"
jane.name

'Jane'
'Mike'

Add fronzen=True, can't modify data any more

@dataclass(frozen=True)
class Person:
    name: str
    ....

Subclassing namedtuple Classes

from collections import namedtuple
from datetime import date

BasePerson = namedtuple(
    "BasePerson",
    "name birthdate country",
    defaults=["Canada"]
)

class Person(BasePerson):
    """A namedtuple subclass to hold a person's data."""
    __slots__ = ()
    def __repr__(self):
        return f"Name: {self.name}, age: {self.age} years old."
    @property
    def age(self):
        return (date.today() - self.birthdate).days // 365


print(Person.__doc__)
jane = Person("Jane", date(1996, 3, 5))
jane.age

A namedtuple subclass to hold a person's data.
25

OrderedDict

基础概念

“OrderedDict” 本身就是一个dict,但是它的特别之处在于会记录插入dict的key和value的顺序

应用实例

from collections import OrderedDict
d = OrderedDict()
d['a'] = 1
d['b'] = 2
d['c'] = 3
d['d'] = 4
print(d)

Out:OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])

如果删除一个key, OrderedDict的顺序不会发生变化:

from collections import OrderedDict
print("Before deleting:\n")
od = OrderedDict()
od['a'] = 1
od['b'] = 2
od['c'] = 3
od['d'] = 4

for key, value in od.items():
    print(key, value)

print("\nAfter deleting:\n")
od.pop('c')
for key, value in od.items():
    print(key, value)

print("\nAfter re-inserting:\n")
od['c'] = 3
for key, value in od.items():
    print(key, value) 
    

Out:Before deleting:

    ('a', 1)
    ('b', 2)
    ('c', 3)
    ('d', 4)
    
    After deleting:
    
    ('a', 1)
    ('b', 2)
    ('d', 4)
    
    After re-inserting:
    
    ('a', 1)
    ('b', 2)
    ('d', 4)
    ('c', 3)

Chainmap

from collections import ChainMap

cmd_proxy = {}  # The user doesn't provide a proxy
local_proxy = {"proxy": "proxy.local.com"}
global_proxy = {"proxy": "proxy.global.com"}

config = ChainMap(cmd_proxy, local_proxy, global_proxy)
config.maps
=>[{}, {'proxy': 'proxy.local.com'}, {'proxy': 'proxy.global.com'}]

ChainMap 提供 .new_child() 和 a .parents property属性:

from collections import ChainMap

dad = {"name": "John", "age": 35}
mom = {"name": "Jane", "age": 31}
family = ChainMap(mom, dad)
family

=>ChainMap({'name': 'Jane', 'age': 31}, {'name': 'John', 'age': 35})

son = {"name": "Mike", "age": 0}
family = family.new_child(son)

for person in family.maps:
   print(person)

{'name': 'Mike', 'age': 0}
{'name': 'Jane', 'age': 31}
{'name': 'John', 'age': 35}

family.parents
Out[11]:
ChainMap({'name': 'Jane', 'age': 31}, {'name': 'John', 'age': 35})
PreviousitertoolsNextCustomization

Last updated 3 years ago

Was this helpful?

💕
Page cover image